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Abstract 


The  best  serial  algorithms  for  numerically  approximating  the  solution  of  a  linear  partial 
differential  equation  (PDE)  exploit  knowledge  of  the  solution  operator.  This  dissertation 
describes  how  the  solution  operator  also  influences  the  behavior  of  parallel  algorithms. 

Approximating  the  solution  at  a  single  location  in  the  problem  domain  is  considered. 
We  derive-a  lower  bound  on  the  error  in  the  approximation  ^that  is  a  function  of  the 
amount  of  data  used  and  the  smoothness  of  the  data  functions.  From  this  derive  a 
lower  bound  on  the  parallel  complexity  of  algorMpis  that  approximate  the  solution.  The 
lower  bound  is  a  linear  function  of  loge^1,  where  c is  an  upper  bound  on  the  error.  Thus 
the  parallel  complexity  increases  as  ^decreases,  independent  of  the  number  of  processors 
used.  We-artSo^bnsfruct  $n  algorithm  ^yvhose  parallel  complexity  is  of  this  form,  proving 

*'  "f.O  '  - 

that  the  form  of  the  bound  is  the  best  possible. 

The  execution  time  of  parallel  algorithms  is  a  function  of  both  the  communication 
costs  and  the  parallel  complexity.  We  describe  bounds  on  the  communication  costs  of 
parallel  algorithms  that  are  functions  of  the  distance  between  collaborating  processors.  If 
the  interconnection  network  of  a  multiprocessor  is  a  d  dimensional  grid,  then  we  prove 
that  the  execution  time  of  algorithms  that  approximate  the  solution  is  bounded  from 
below  by  a  linear  function  of  e_3+i\  where  7  is  a  positive  constant  determined  by  the 
smoothness  of  the  data  functions.  Thus,  for  small  e,  the  communication  costs  are  the 
dominant  constraints  on  the  optimal  performance.  *  <  ■ 
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Chapter  1 
Introduction 


The  best  serial  algorithms  for  numerically  approximating  the  solution  of  a  partial  differ¬ 
ential  equation  (PDE)  exploit  knowledge  of  the  behavior  of  the  solution  of  the  PDE. 
The  design  of  parallel  algorithms  is  a  more  complicated  process  than  the  design  of  se¬ 
rial  algorithms,  and  the  performance  of  a  parallel  algorithm  is  strongly  influenced  by  the 
architecture  of  the  multiprocessor  used.  This  work  describes  ways  in  which  the  data  de¬ 
pendence  of  the  solution  of  a  well-posed  linear  scalar  PDE  determines  bounds  on  the 
effectiveness  of  parallel  algorithms  and  architectures. 

We  assume  that  a  Green's  function  of  a  particular  form  exists  for  the  PDE.  We  assume 
that  the  numerical  methods  use  values  of  the  data  functions  sampled  at  locations  in  their 
respective  domains,  and  calculate  values  of  the  solution  at  locations  in  its  domain.  We 
assume  that  the  data  functions  satisfy  conditions  suitable  for  bounding,  a  priori,  the  error 
of  an  mth  order  accurate  approximation  to  the  PDE,  for  some  fixed  positive  integer  m. 
We  assume  that  the  error  introduced  by  a  numerical  algorithm  is  required  to  be  less  than 
some  given  tolerance  e  for  each  solution  value.  These  assumptions  are  described  in  more 
detail  later  in  this  chapter.  Given  these  assumptions,  we  describe  the  following  result: 

•  We  calculate  a  lower  bound  on  the  parallel  complexity  of  any  parallel  algorithm  that 
satisfies  the  above  assumptions.  This  lower  bound  is  a  linear  function  of  logjt'1. 
Therefore,  the  execution  time  required  to  approximate  the  solution  of  the  PDE  must 
grow  if  the  error  bound  decreases,  no  matter  how  many  processors  are  available.  We 
also  construct  an  algorithm  whose  parallel  complexity  has  the  same  form,  proving 
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that  the  form  of  the  lower  bound  is  the  best  possible.  Finally,  we  show  that  the 
time  spent  moving  data  between  processors  can  constrain  the  execution  time  even 
more  strongly  than  the  lower  bound  on  the  parallel  complexity.  For  example,  if 
the  interconnection  network  of  a  multiprocessor  is  a  d  dimensional  mesh,  then  the 
execution  time  is  bounded  from  below  by  a  negative  power  of  t,  independent  of  the 
number  of  processors  that  are  available. 

The  dissertation  is  organized  in  the  following  manner.  The  rest  of  this  chapter  describes 
our  models  for  multiprocessor  architectures,  parallel  algorithms,  and  numerical  algorithms 
for  approximating  solutions  to  linear  PDEs.  Chapter  2  describes  the  data  dependency 
analysis,  and  properties  of  the  results  that  will  be  used  in  later  chapters.  Chapter  3 
describes  lower  bounds  on  the  execution  time  of  parallel  algorithms  as  the  problem  and 
the  multiprocessor  grow  in  size.  Chapter  4  describes  the  bounds  derived  in  the  two 
previous  chapters  for  example  problems,  and  discusses  generalizing  the  assumptions  so  as 
to  include  a  larger  class  of  problems.  Chapter  5  summarizes  the  work  presented  here,  and 
briefly  discusses  some  generalizations  of  the  results.  Appendices  contain  details  left  out 
of  the  main  presentation. 

1.1  Multiprocessor  Architectures 

We  are  primarily  interested  in  MIMD1  multiprocessors,  and  in  modelling  parallelism  at 
the  level  of  concurrent  execution  of  floating  point  operations.  This  bias  is  reflected 
in  the  following  multiprocessor  model.  Most  of  our  results  apply  to  a  larger  class  of 
multiprocessors  and  parallel  algorithms,  for  an  appropriate  interpretation. 

1.1.1  Multiprocessor  Model 

Each  multiprocessor  is  made  up  of  memory,  processors,  and  unidirectional  communication 
channels.  The  multiprocessor  is  represented  by  a  labelled  directed  graph  2,  (V,E).  Each 

'Multiple  /nstruction  Multiple  Data  is  one  category  of  Flynn’s  multiprocessor  taxonomy  [Fly72]. 
If  a  multiprocessor  is  in  MIMD,  then  the  instruction  a  processor  is  executing  and  the  data  it  is  using 
can  be  different  from  those  of  other  processors  at  any  given  moment. 

2 See  p.5Q  of  [A HU 74] 
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vertex,  €  V,  of  the  graph  is  either  a  memory  or  a  processor.  Each  edge,  e3  £  E,  is  a 
unidirectional  communication  channel  between  two  vertices,  permitting  the  source  vertex 
to  send  data  to  the  destination  vertex.  The  subscript  of  the  source  vertex  of  edge  e,  is 
s(j).  The  subscript  of  the  destination  vertex  of  edge  e3  is  d{j ).  A  vertex  is  connected  to 
a  vertex  v2  if  there  exists  an  edge  for  which  uj  is  the  source  and  v2  is  the  destination.  The 
multiprocessor  uses  a  fixed  length  floating  point  number  representation  for  real  numbers. 
This  representation  is  assumed  to  be  sufficient  to  approximate  the  solution  of  a  given 
PDE  to  within  some  given  error  tolerance.  The  components  of  the  multiprocessor  are 
characterized  by  their  abilities  to  manipulate  these  floating  point  numbers. 

Each  memory  vertex  V{  is  labelled  by  two  nonnegative  numbers.  The  capacity  c,  is  the 
number  of  floating  point  numbers  the  memory  can  hold.  The  access  time  a,  is  the  time 
it  takes  to  recall  a  floating  point  number  from  the  memory. 

Each  processor  is  a  serial  processor,  which  we  define  to  be  a  processor  that  calculates 
floating  point  operations  sequentially.  Each  processor  vertex  is  labelled  by  a  positive 
real  number  indicating  the  time  it  takes  to  add  two  floating  point  numbers  together, 
/,,(+).  This  is  used  as  the  standard  unit  of  computation.  All  floating  point  operations  are 
computed  by  the  composition  of  operators  from  some  given  set  of  binary  and  unary  floating 
point  operators.  We  assume  that  negation  is  the  only  unary  floating  point  operation  among 
this  set  of  primitives.  We  also  assume  that  the  execution  time  of  any  binary  floating  point 
operation  is  greater  than  or  equal  to  the  execution  time  of  a  binary  floating  point  addition. 

Each  edge  e2  is  labelled  by  two  positive  numbers.  The  bandwidth  b,  is  the  number  of 
floating  point  numbers  that  can  be  transmitted  over  the  channel  during  some  given  unit 
of  time  t.  The  transmission  time  t}  is  the  time  it  takes  to  send  a  single  floating  point 
number  from  v,^)  to  Vd(j)  using  communication  channel  e;.  These  values  may  include 
the  overhead  of  starting  a  transmission  between  the  vertices,  and  so  bj  may  not  be  r/tj. 
The  values  will  depend  on  the  communication  protocol. 

Many  of  the  assumptions  in  this  model  are  merely  to  establish  a  concrete  model, 
and  are  easily  generalized.  For  example,  the  restriction  to  binary  primitive  floating  point 
operators  is  a  reasonable  assumption,  but  we  only  want  to  establish  an  upper  bound  on 
the  number  of  operands  of  the  primitive  floating  point  operators. 
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1.1.2  Communication  Capabilities 

The  performance  of  a  parallel  algorithm  is  strongly  affected  by  the  ability  of  the  multi¬ 
processor  to  move  data  between  vertices.  Define  a  path  P  to  be  a  sequence  of  edges, 

|  /  =  1,.  such  that  the  destination  vertex  of  edge  en  is  the  source  vertex  for 

edge  eJI+1.  Define  the  length  of  the  path  P,  L( P),  to  be  the  sum  of  the  transmission 
times  along  this  path, 

i(P)  =  E<»  • 

i=i 

L(P)  is  then  the  time  required  to  send  a  single  floating  point  number  along  the  path. 

Define  the  distance  from  vertex  vj  to  vertex  v2l  D(v1,v2),  to  be  the  length  of  the  path 
of  minimum  length  starting  at  and  ending  at  v2.  Define  D{v,v)  to  be  zero.  D(vi,v2) 
is  the  minimum  amount  of  time  it  takes  to  send  a  single  floating  point  number  from  v\ 

to  v2. 

The  diameter  of  a  subset  of  the  vertices  of  the  graph,  V'  C  V,  is  the  maximum 
distance  between  two  vertices  of  the  subset, 

diam(V')  =  max  D(v,w )  . 

VtiuEV' 

A  center  of  this  subset  is  a  vertex  that  minimizes  the  maximum  distance  between  itself 
and  other  vertices  in  the  subset, 

C  =  {c  I  c  e  V",  max  D(c,  w)  =  min  max  D( i>,  u;)}  . 

The  radius  of  the  subset  is  this  distance, 


rad(V')  =  min  max  D(v,  w) 


The  radius  satisfies 

diam(V')/2  <  rad(V')  <  diam(V')  . 

If  the  vertices  in  a  set  V'  are  all  collaborating  in  a  calculation,  then  the  radius  is  a  lower 
bound  on  the  time  spent  moving  data  between  vertices  during  the  calculation. 


Example:  Assume  that  a  calculation  produces  one  floating  point  number, 
that  one  vertex  of  a  subset  V ’  has  been  designated  to  hold  this  result,  and  that 
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each  vertex  in  V'  is  in  sole  possession  of  information  crucial  to  the  calculation. 

Then  a  lower  bound  on  the  time  spent  in  moving  this  information  through 
the  communication  channels  is  rad(V').  The  vertex  that  has  been  designated 
to  hold  the  result  cannot  be  closer  than  rad(V')  to  all  of  the  other  vertices 
of  the  set,  and  some  datum  used  by  the  result  will  take  at  least  this  long  to 
arrive  at  the  designated  vertex. 

1.1.3  Example  Architectures 

Most  multiprocessor  architectures  currently  being  investigated  have  fairly  simple  graphs, 
with  essentially  homogeneous  processor  and  memory  capabilities.  See  Feng  [Fen81]  for 
a  survey  of  some  of  the  graphs  that  have  been  considered.  The  following  examples  are 
common  designs,  each  of  whose  behavior  is  representative  of  a  class  of  architectures.  All 
of  the  examples  can  be  described  as  undirected  graphs.  If  an  edge  exists  from  y,  to  v}, 
then  an  edge  with  the  same  parameters  also  exists  from  to  y,.  We  will  refer  to  this 
pair  as  a  single  edge  in  the  following  discussion.  Additionally,  all  processor,  memory,  and 
communication  channel  capabilities  are  the  same  unless  otherwise  noted.  The  values  for 
the  capacity,  the  access  time,  the  floating  point  addition  time,  the  bandwidth,  and  the 
transmission  time  are  denoted  by  c,  a,  /(+),  6,  and  t  respectively. 

•  Fully  Connected.  The  most  powerful  communications  network  has  a  graph  that  is 
a  clique.  That  is,  every  vertex  is  connected  to  every  other  vertex.  The  diameter  of 
any  subset  of  the  multiprocessor  is  t. 

•  Star.  One  multiprocessor  architecture  suitable  for  small  numbers  of  processors  can 
be  modelled  as  N  processors  connected  to  a  single  memory  in  a  star  topology.  That 
is,  there  are  N  communication  channels,  and  each  channel  connects  a  different 
processor  with  the  memory.  The  diameter  of  any  subset  of  the  multiprocessor 
containing  at  least  two  processors  is  2 1. 

•  Distributed  Shared  Memory.  Both  of  the  previous  examples  have  the  property  that 
all  memories  are  equidistant  from  all  processors.  We  will  call  an  architecture  with 
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this  property  a  shared  memory  architecture.3  While  the  two  previous  examples 
satisfy  this  condition,  they  are  expensive  to  implement  for  large  numbers  of  proces¬ 
sors.  A  less  expensive  shared  memory  architecture  replaces  the  single  large  memory 
in  the  previous  examples  with  many  smaller  memories  and  a  distinct  interconnec¬ 
tion  network.  The  vertices  of  the  network  are  themselves  small  capacity  memories, 
switches,  that  accept  and  retransmit  data.  Each  processor  and  memory  external 
to  this  interconnection  network  has  one  edge  into  the  network.  The  interconnec¬ 
tion  network  provides  paths  from  any  processor  to  any  external  memory.  Assume 
that  there  are  N  processors  and  N  external  memories,  where  N  =  2k  for  some 
positive  integer  k.  If  the  interconnection  network  is  implemented  as  an  Omega 
network  [HB84]  [GGK*83],  then  there  are  .5-  Nlog2N  switches,  each  of  which  has 
4  impinging  edges.  The  distance  between  any  processor  and  external  memory  is 
Mog2  N.  If  all  communication  between  processors  is  via  external  memory,  then  the 
diameter  of  a  subset  of  vertices  containing  at  least  two  processors  is  2 1  ■  log2  N.  If 
interprocessor  communication  is  direct,  then  a  subset  of  K  processors  can  have  a 
diameter  as  small  as  t  ■  log2  K. 

•  d  Dimensional  Array.  We  call  an  architecture  a  local  memory  architecture  if  each 
memory  is  connected  to  only  one  vertex,  a  processor,  and  each  processor  is  con¬ 
nected  to  only  one  memory.  Thus,  each  processor  has  quick  access  to  its  local 
memory,  but  at  the  cost  of  slower  access  to  the  others.  A  common  example  is  a 
d  dimensional  array  of  processor- memory  pairs.  Each  processor  is  connected  to  up 
to  2d  other  processors,  forming  a  d  dimensional  array.  Assume  that  the  array  has 
equal  length  sides  and  N  processor- memory  pairs.  Then  the  diameter  of  a  subset 
of  K  processors  is  no  more  than  dt  ■  (N1^  —  1),  and  no  less  than  dt  ■  (Kl^d  —  1). 
The  maximum  is  the  diameter  of  the  multiprocessor.  The  minimum  is  achieved  by 
a  subset  of  vertices  and  edges  that  is  a  d  dimensional  array  with  K1^  processors 
on  a  side. 

3Currently,  the  most  common  shared  memory  multiprocessor  architectures  can  be  described  as  N 
processors  and  a  single  memory  connected  by  a  shared  bus.  The  graph  of  this  particular  architecture 
can't  be  represented  using  the  above  model,  but  the  communication  capabilities  of  this  network  can 
be  considered  to  fall  somewhere  between  that  of  the  fully  connected  network  and  that  of  the  star 
network. 
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•  Hypercube.  If  the  dimension  of  an  array  of  N  processors  with  equal  length  sides  is 
log2  N ,  then  the  graph  is  of  a  log2  N  dimensional  binary  hypercube4.  Each  processor 
has  log2  N  edges.  The  diameter  of  a  subset  of  K  processors  is  between  t-log2  N  and 
t  ■  log2/\  .  The  lower  bound  is  associated  with  a  subset  of  vertices  and  edges  that 
approximates  a  log2  K  dimensional  hypercube,  or  whose  complement  approximates 
a  log2(./V  —  K)  dimensional  hypercube. 

The  difficulty  and  cost  of  constructing  multiprocessors  based  on  these  designs  can  vary 
dramatically.  As  the  number  of  processors  grows,  more  and  more  of  the  examples  become 
infeasible.  Ultimately,  the  three  dimensionality  of  the  physical  world  will  limit  feasible 
architectures  to  something  that  can  be  embedded  in  some  variant  of  a  three  dimensional 
grid  of  processors.  Both  technological  limits  and  basic  physical  laws  constrain  how  small 
the  processors  and  memories  can  be  [MC80],  and  the  volume  taken  up  by  a  multiprocessor 
must  increase  as  the  number  of  devices  increase.5  Thus,  the  speed  of  light  will  constrain 
the  speed  of  communications  between  the  devices  in  the  multiprocessor  [Kro85].  But 
these  limitations  are  not  the  only  difficulties  associated  with  these  architectures.  For 
the  distributed  shared  memory  architecture  example,  increasing  the  number  of  processors 
increases  the  minimum  number  of  edges  data  must  travel  over  when  being  sent  between 
any  processor  and  external  memory.  This  necessarily  increases  the  communication  time 
when  t  is  kept  fixed.  For  fully  connected,  star,  and  hypercube  based  architectures,  the 
number  of  input/output  ports  on  a  single  device  grows  as  the  number  of  processors 
increase.  This  increases  device  size,  cost,  and  complexity. 


1.2  Parallel  Algorithms 

We  model  an  algorithm  as  a  partially  ordered  set  of  instructions  of  the  form 

y  =  flop(i i,. . . , xn)  . 


4See  Seitz  [Sei85]  for  a  description  of  a  particular  hypercube  based  multiprocessor  architecture. 
5Similar  arguments  constrain  how  small  f(+)  can  be. 
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Hop  is  a  floating  point  operation,  y  is  a  floating  point  variable,  and  are 

floating  point  constants  and  variables.  If  a  floating  point  variable  is  used  by  two  differ¬ 
ent  instructions,  and  if  one  of  the  instructions  changes  the  value  of  that  variable,  then 
the  partial  order  specifies  a  precedence  relationship  between  them.  These  are  the  only 
relationships  established  by  the  partial  order.6 

This  model  of  an  algorithm  ignores  many  of  the  details  usually  found  in  algorithms.  In 
particular,  integer  arithmetic  and  instructions  controlling  conditional  execution  are  not  rep¬ 
resented.  But  the  time  spent  executing  floating  point  operations  generally  dominates  the 
total  execution  time  of  algorithms  for  numerically  approximating  the  solution  of  PDEs. 
Moreover,  this  model  is  sufficient  for  establishing  the  lower  bounds  described  in  sec¬ 
tion  1.2.2. 

We  define  the  serial  computational  complexity  of  an  algorithm,  C3,  to  be  the  time 
spent  calculating  the  floating  point  operations  on  some  standard  serial  processor.  We 
will  often  simply  refer  to  this  as  the  serial  complexity.  See  [Kro85]  and  [AHU74]  for 
other  definitions  of  serial  complexity.  The  standard  processor  is  assumed  to  satisfy  the 
assumptions  made  in  the  previous  section  about  the  processors  in  the  multiprocessor. 
All  sequential  orderings  of  the  instructions  of  an  algorithm  that  are  consistent  with  the 
partial  ordering  will  have  the  same  serial  complexity,  and  will  produce  the  same  results 
when  executed  on  a  serial  processor.  Therefore,  we  will  also  refer  to  the  partially  ordered 
set  of  instructions  as  a  serial  algorithm.  By  the  assumptions  made  on  the  processor, 
the  serial  complexity  is  a  weighted  sum  of  the  number  of  floating  point  operations.  The 
weights  depend  on  the  specifics  of  the  standard  processor. 

A  parallel  implementation  of  an  algorithm  on  a  multiprocessor  specifies  when  and  on 
which  processor  each  instruction  is  executed,  where  the  data  is  initially  assigned,  where 
each  intermediate  and  final  result  is  stored,  and  what  communication  takes  place  during 
the  execution  of  the  algorithm.  We  will  refer  to  this  information  as  the  scheduling  of 
the  algorithm.  A  scheduling  is  well-defined  if  it  is  compatible  with  the  partial  order's 
precedence  relationships,  and  if  all  demands  made  on  the  processors,  memories,  and  com¬ 
munication  channels  are  within  their  capabilities.  We  define  a  parallel  algorithm  to  be 


6Thus,  the  algorithm  can  be  represented  by  a  data  flow  graph.  See  (HB84,  pages  740-744]. 
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the  union  of  a  serial  algorithm,  a  multiprocessor  architecture,  and  a  well-defined  schedul¬ 
ing.  Thus,  we  associate  a  deterministic  serial  algorithm  with  each  parallel  algorithm.  In 
practice,  the  serial  algorithm  may  be  a  function  of  the  data.  This  can  make  determining 
the  serial  algorithm  difficult,  especially  for  chaotic  parallel  algorithms  [CM69]  [Bau78]. 
But  we  can  still  analyze  characteristics  of  the  parallel  algorithm  by  establishing  necessary 
properties  of  the  associated  serial  algorithm. 

1.2.1  Parallel  Algorithm  Costs 

We  define  the  cost,  Tp,  to  be  the  time  it  takes  to  execute  a  parallel  algorithm  on  a  multipro¬ 
cessor.  Unlike  serial  algorithms,  the  cost  of  a  parallel  algorithm  is  not  well  approximated  by 
a  weighted  sum  of  the  number  of  floating  point  operations.  Instead,  there  are  two  distinct 
costs  associated  with  an  efficient  parallel  algorithm,  parallel  computational  complexity 
and  communication  cost. 

Parallel  computational  complexity 

The  parallel  computational  complexity  is  the  amount  of  time  during  which  at  least  one 
of  the  processors  is  busy  calculating  the  instructions  of  the  serial  algorithm.  We  will  also 
refer  to  the  parallel  computational  complexity  as  simply  the  parallel  complexity,  or  the 
parallel  computation  cost.  See  [Kro85]  for  other  definitions  of  parallel  complexity.  If  the 
multiprocessor  has  N  identical  processors,  then  the  parallel  complexity,  Cp,  is  bounded 
from  below  by7 


The  serial  complexity  in  this  expression  is  based  on  using  one  of  the  N  processors  as  the 
standard  serial  processor.  For  this  bound  to  be  achieved,  all  of  the  processors  must  be 
busy  all  of  the  time.  Even  if  communication  is  free,  i.e.  a  =  0,  t  =  0,  and  b  =  oo,  few 
algorithms  have  N  independent  operations  to  be  calculated  throughout  their  execution, 
The  partial  order  can  be  examined  to  determine  the  maximum  number  of  independent 
operations  that  can  be  calculated  at  any  one  point  in  the  algorithm.  The  schedule  may 
further  limit  the  parallelism. 

7fz]  represents  the  smallest  integer  greater  than  or  equal  to  x. 
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Measures  of  the  parallelism  in  the  parallel  algorithm  are  the  computational  speed-up, 


and  the  computational  efficiency, 


The  computational  speed-up  represents  how  much  faster  the  parallel  algorithm  can  be 
executed  than  the  serial  algorithm  when  communication  is  free.  It  is  always  less  than  or 
equal  to  N.  If  all  the  processors  are  identical,  then  the  computational  efficiency  measures 
the  average  amount  of  time  each  of  the  processors  is  busy  computing.  It  is  always  less 
than  or  equal  to  1. 

Example:  Consider  a  multiprocessor  with  N/ 2  processors,  each  of  which 
calculates  a  floating  point  addition  in  time  /(+).  Then  the  summation  of  N 
floating  point  numbers, 

j v 

i=i 

requires  at  least  time  /(+)  •  [log2  ^1  ■  Add  is  a  binary  operation,  and  each  time 
step  of  length  /(+)  replaces  M  existing  summands  by  at  least  M/2  results. 

These  results  are  the  summands  for  the  next  step.8 

The  best  serial  algorithm  has  a  serial  complexity  of  /(+)  ■  (N  —  1)  on 
one  of  these  processors.  Thus,  the  computational  speed-up  is  bounded  from 
above  by  N/\\og2N]  for  a  parallel  algorithm  based  on  this  serial  algorithm. 

The  computational  efficiency  is  bounded  from  above  by  2/  log2  N  when  N/2 
processors  are  used.  The  poor  computational  efficiency  reflects  the  fact  that, 
after  each  step,  half  of  the  processors  active  previously  have  no  more  work  to 
do.  If  fewer  processors  are  used,  then  the  computational  efficiency  increases. 

But  then  the  computational  speed-up  decreases  and  the  parallel  complexity 
increases. 

8See  Lemma  1  in  Kuck  [Kuc78,  page  95]. 
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Communication  cost 

The  communication  cost,  Wv,  is  the  amount  of  time  during  which  data  is  being  moved 
between  the  processors  and  memories  of  the  multiprocessor.  This  includes  the  time  spent 
accessing  a  memory  location  and  packaging  the  data  for  transmission.  The  achievable 
parallel  complexity  is  constrained  by  the  number  of  processors  in  the  multiprocessor  and  by 
the  serial  algorithm.  The  communication  cost  is  additionally  constrained  by  the  graph  of 
the  multiprocessor.  For  example,  the  maximum  computational  speed-up  in  the  summation 
problem  is  achieved  by  using  Nj 2  processors  for  an  N  term  sum.  Since  each  processor 
produces  an  intermediate  result  that  is  needed  to  compute  the  final  sum,  the  radius  of 
the  N/2  processor  subset  of  the  multiprocessor  will  bound  the  communication  cost  from 
below. 


Effective  scheduling 

Most  other  reasonable  costs,  like  integer  arithmetic  or  overhead  for  synchronization,  are 
easily  included  in  the  model  as  communication  or  parallel  computation  costs.  An  unrea¬ 
sonable  cost  is  incurred  when  the  schedule  unnecessarily  augments  the  partial  order  of  the 
serial  algorithm,  adding  extra  delays.  For  a  given  assignment  of  instructions  to  processors 
and  data  to  memories,  we  define  a  well-defined  schedule  to  be  effective  if: 

•  a  processor  executes  an  instruction  whenever  its  operands  are  available,  the  processor 
is  not  currently  executing  another  instruction  or  communication  request,  and  the 
partial  order  of  the  serial  algorithm  is  not  violated  by  doing  so. 

•  when  a  processor  is  blocked  from  executing  any  more  instructions  due  to  a  lack  of 
data,  and  that  data  exists,  the  data  is  moved  to  the  processor  as  quickly  as  possible. 

We  henceforth  restrict  ourselves  to  parallel  algorithms  with  effective  schedules.  An  effec¬ 
tive  schedule  is  not  necessarily  a  good  one,  but  having  one  ensures  that  some  part  of  the 
multiprocessor  is  actively  computing  or  communicating  at  all  times. 
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Total  cost 

A  lower  bound  on  Tp  is  9 

max{Cp,  Wp}  . 

But  a  processor  cannot  calculate  a  floating  point  operation  until  the  operands  are  available. 
If  the  operands  must  be  drawn  from  some  other  device,  then  the  calculation  of  the 
operation  is  blocked  until  they  arrive.  If  the  schedule  is  effective,  then  an  upper  bound 
on  the  cost  of  the  parallel  algorithm  is 


TP<CP+WP  . 


We  define  the  speed-up  and  the  efficiency  to  be  the  corresponding  measures  for  the  total 
cost  of  a  parallel  algorithm  [Kuc78]  ,  i.e. 


and 


respectively.  N  is  again  the  number  of  processors. 


Computation  bound  algorithms 

A  parallel  algorithm  is  computation  bound  if  the  communication  cost  is  no  more  than 
the  parallel  complexity,  Wp  <  Cp.  If  a  parallel  algorithm  is  computation  bound  for  a 
multiprocessor  architecture,  then  the  architecture  is  adequate  in  the  sense  that  the  partial 
order  of  the  corresponding  serial  algorithm  determines  the  dominant  part  of  the  cost. 
This  condition  is  also  a  useful  tool  for  deciding  how  many  processors  to  use  to  solve  a 
problem.  For  a  given  multiprocessor  architecture,  it  is  common  for  the  parallel  complexity 
to  be  decreasing  and  the  communication  cost  to  be  increasing  as  the  number  of  processors 

9 We  will  use  max{ai, . . .  ,a„}  as  an  alternative  notation  for 

min  a  . 

{»i,  .,««) 

Similar  notation  will  be  used  for  the  minimum  element  in  a  set. 
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used  to  execute  an  algorithm  increases.  This  describes  the  behavior  of  a  set  of  parallel 
algorithms,  one  for  each  number  of  processors,  and  will  only  hold  over  a  range  of  numbers 
of  processors.  The  largest  number  of  processors  for  which  the  corresponding  parallel 
algorithm  is  computation  bound  represents  an  estimate  of  the  number  of  processors  to 
use  to  minimize  the  cost.  If  Wp  =  Cv,  then  the  cost  of  the  corresponding  parallel  algorithm 
is  at  most  twice  that  of  the  optimal,  and  will  usually  be  much  better  than  that.  A  similar 
estimate  can  be  used  to  evaluate  a  type  of  architecture  as  both  the  problem  size  and  the 
number  of  processors  grow. 

Example:  The  summation  problem  can  be  scheduled  so  that  the  parallel 
complexity  achieves  the  minimum  of  /(+)■  [Iog2  N]  by  using  10  [N/ 2J  identical 
processors.  This  performance  is  guaranteed,  modulo  constants,  for  any  pos¬ 
itive  N  if  the  parallel  algorithms  remain  computation  bound.  But  the  radius 
of  a  [JV/2J  processor  linear  array  is  greater  than  or  equal  to  t  •  [JV/4J,  where 
t  is  the  distance  between  connected  processors.  Since  the  radius  represents 
a  lower  bound  on  the  communication  cost  for  this  problem,  the  algorithm 
cannot  be  computation  bound  if 

;V>  4-  ^  •logJlV+  + 

For  the  parallel  algorithm  to  be  computation  bound  for  arbitrary  V,  the  radius 
of  the  jV/2  processors  must  be  bounded  by  /(+>  ■  flog2  .V]  for  all  N.  No 
d  dimensional  array  multiprocessor  will  satisfy  this  condition  for  large  N  if  t 
and  /(+)  are  bounded  away  from  zero. 

1.2.2  Lower  Bounds 

A  serial  algorithm  describes  a  mapping  from  a  set  of  data  to  a  set  of  solution  values 
specified  by  the  problem.  Define  a  nontrivial  solution  value  to  be  one  that  is  the  result  of 
a  binary  floating  point  operation.  Define  a  set  of  solution  values  to  be  independent  if  no 
two  of  them  have  the  same  absolute  values  for  all  possible  data.  Let  the  set  of  nontrivial 

10j*J  represents  the  largest  integer  less  than  or  equal  to  x. 
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independent  solution  values  of  an  algorithm  a  be 

U  =  {u:\j  =  l,...,NaJ  , 

where  Na<u  is  the  number  of  these  solution  values.  For  this  set  to  be  independent,  at 
least  iVo  u  binary  floating  point  operations  are  required  to  generate  its  values.  This  is 
a  consequence  of  our  assumption  that  unary  negation  is  the  only  unary  floating  point 
operation. 

Let  the  data  that  is  required  to  produce  these  solution  values  be 

{$*!*  =  • 

That  is,  there  is  no  gk  in  this  set  that  can  be  arbitrarily  changed  without  changing  at 
least  one  of  the  solution  values.  Na<g  is  the  number  of  these  data.  For  a  datum  to 
be  required  by  a  nontrivial  solution  value,  some  unary  function  of  its  value  must  be  an 
operand  of  a  binary  floating  point  operation.  Therefore,  at  least  \Na<g/2]  binary  floating 
point  operations  are  required  to  use  all  of  this  data.  Since  addition  is  the  least  expensive 
binary  floating  point  operation,  we  have  proven  the  following  lemma. 


Lemma  1.1  A  lower  bound  on  the  serial  complexity  of  an  algorithm  a  is 


/(+) 


•  max 


For  Uj  (E  U ,  define  /Va(u;)  to  be  the  amount  of  data  that  is  required  by  a  to  compute 
Uj.  Again,  this  means  that  there  is  no  gk  in  this  set  that  can  be  arbitrarily  changed 
without  changing  the  value  of  ur  Each  datum  must  be  the  operand  of  a  binary  floating 
point  operation,  and  this  operation  produces  a  result  that  will  itself  be  the  operand  of 
a  binary  floating  point  operation.  Summing  the  number  of  operations  indicated  by  this 
argument  leads  to  the  result  that  the  serial  complexity  of  calculating  u3  is  bounded  from 
below  by  /(+)  •  (iVa(u;)  -  l).11  The  next  lemma  is  a  direct  consequence  of  this. 

"See  Lemma  1  in  Kuck  [Kuc78,  page  95] 
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Lemma  1.2  A  lower  bound  on  the  serial  complexity  of  an  algorithm  a  is 

/(+)  •  max  (iVa(uj)  -  1)  . 

\ij 


For  the  rest  of  this  section,  assume  that  we  know  a  lower  bound  on  the  serial  complexity 
of  the  form  /(+)  •  C,(Na(uj))  for  all  Uj  6  U,  where  C,(N)  >  N  —  1. 

For  a  parallel  implementation  of  a,  say  that  a  processor  collaborates  in  the  computation 
of  Uj  if  changing  the  results  of  all  of  the  floating  point  operations  calculated  by  that 
processor  can  change  the  value  of  Uj.  Define  Pa[uj)  to  be  the  number  of  processors 
that  collaborate  in  the  calculation  of  u:  for  a  given  parallel  implementation  of  a.  For  a 
given  multiprocessor,  consider  the  subset  of  Pa(uj)  processors  with  minimum  radius.  Let 
r(Pa(uj))  be  this  radius.  Then  the  following  lemma  is  a  consequence  of  the  discussion  in 
the  example  on  page  4. 

Lemma  1.3  For  all  u}  (=  U ,  the  communication  cost  of  a  parallel  implementation  of 
an  algorithm  a  on  a  given  multiprocessor  is  bounded  from  below  by  r{Pa(uj)). 

This  follows  directly  from  the  fact  that  information  is  needed  from  all  Pa[uj)  processors 
in  order  to  calculate  Uj. 

The  parallel  complexity  of  calculating  Uj  on  a  multiprocessor  can  be  no  faster  than  a 
parallel  implementation  of  summing  its  required  data.  The  data  can  only  be  reduced  by 
a  sequence  of  binary  operations,  and  binary  add  is  assumed  to  be  the  fastest  nonunary 
floating  point  operation.  Thus,  one  lower  bound  on  the  parallel  compelxity  is  /(+)  • 
flog2^Va(Uj)],  as  in  the  example  on  page  10.  This  fact  and  inequality  1.1  on  page  9 
prove  the  following  lemma. 


Lemma  1.4  Assume  that  an  algorithm  a  calculates  u3.  Then  the  parallel  complexity 
of  a  parallel  implementation  of  a  on  a  multiprocessor  with  identical  processors  is 
bounded  from  below  by 


/(+)  '  max 


C,(iV„(u,)) 

p.M 


flog2  .Va(  u_,  )1 
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Define 


and 


iVa,,  =  max  Na(uj) 


Pa,„  =  max  Pa(u.)  . 
u,eu 


Theorem  1.1  The  parallel  cost  of  a  parallel  implementation  of  an  algorithm  &  on  a 
multiprocessor  with  identical  multiprocessors  is  bounded  from  below  by 


max  <  r(Pa  „) ,  /(+) 


Cs(Na,.) 

Pa., 


>/(+)•  n°S2  NaA 


Proof:  The  total  cost  is  bounded  from  below  by  max{Cp,  Wp) .  The  result  then  follows 
from  Lemmas  1.3  and  1.4.  I 

Let  r(p)  be  the  radius  of  the  p  processor  subset  of  a  multiprocessor  that  has  the  minimum 
radius. 


Corollary  1.2  For  a  P  processor  multiprocessor  with  identical  processors ,  a  lower 
bound  on  the  parallel  cost  of  any  parallel  implementation  of  a  serial  algorithm  a  is 


min  max 
p€{i . P} 


r(p)>  /(+) 


C,(Na,.) 


if(  +  )  ’  ’ /(  + 


)  ' 


N„ 


•/(+) 


2  P 


Proof:  For  any  parallel  implementation  of  an  algorithm  a  on  this  multiprocessor,  Pa  „  is 
a  member  of  the  set  {1,...,P}.  Thus,  the  first  three  terms  inside  the  max  operator 
follow  from  Theorem  1.1.  The  other  two  terms  come  from  Lemma  1.1  and  inequality  1.1. 


1.3  Numerical  Algorithms  for  PDEs 

In  Sections  1.3.1  through  1.3.4  we  describe  the  class  of  linear  PDEs  whose  solutions 
we  will  be  approximating  and  introduce  some  useful  notation.  In  Section  1.3.5  we  give 
a  simple  example  of  such  a  PDE.  In  Section  1.3.6  we  describe  the  type  of  numerical 
approximations  we  are  analyzing.  We  finish  with  a  description  of  algorithmic  features  of 
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these  numerical  approximations.  Some  of  these  assumptions  are  simplistic,  but  the  results 
derived  in  Chapters  2  and  3  carry  over  immediately  for  more  realistic  assumptions.  This 
is  discussed  briefly  in  Chapters  4  and  5. 

1.3.1  Problem  Description 

Let  be  the  d  dimensional  Euclidean  vector  space.  Let  Q  be  a  compact  subset  of 
For  any  nonnegative  integer  k,  let  Ik  be  the  k  dimensional  unit  cube 

Ik  =  [0,1]  x  ...  x  [0,1]  C  . 

1)  We  assume  that  we  are  approximating  the  solution  of  a  linear  scalar  partial 
differential  equation  defined  on  fi  whose  solution  operator  can  be  represented 
by  an  expression  of  the  form 

'  r 

u(*)  =  X!  /  V,(z,x)gl{x)dx 
»=1 

for  any  z  G  fi.  u(z)  is  the  solution  function,  l  is  a  positive  integer ,  the  set 
{d,  |i  €  {1,. . . ,  /}}  is  made  up  of  nonnegative  integers,  and  the  set  of  functions 
{ g{  1 1  £  { 1 }  represent  the  problem  data. 

Thus,  the  solution  function  is  the  sum  of  /  components  of  the  form 

U,{z)  =  /  tyi(z,x)gi(x)dx  . 

Jr*, 

For  the  class  of  problems  being  considered  here,  the  kernels  {'I',}  are  linear  functionals  of 
the  Green’s  function  for  the  PDE.  See  Section  1.3.5,  Chapter  4,  and  Butkovskiy  [But82] 
for  examples  of  this  type  of  representation  of  the  solution  operator. 

1.3.2  Kernel  Assumptions 

We  assume  the  following  properties  about  the  functions  introduced  in  Section  1.3.1. 
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Notation 

For  a  given  k,  let  B(x\  6)  be  the  k  dimensional  open  ball  of  radius  <5  centered  on  x.  Define 
the  boundary  of  an  arbitrary  set  A'  in  3?*  to  be  the  set 

{x  I W  >  0  (  B(x]  6)  n  X  ±  0  and  £(x;  6)  n  -  X  ±  0  )  }  . 

That  is,  no  ball  centered  on  a  point  in  the  boundary  does  not  contain  points  in  both  X 
and  5Rfc  —  X .  Denote  the  boundary  by  dX .  Define  a  function  #(x)  to  be  a  type  1  function 
in  the  domain  Ik  if  it  satisfies  the  following  4  conditions. 

a)  'I'(x)  £  Lx{Ik),  i  e.  jJk  |'P(x)|dx  is  well  defined  and  finite. 

b)  The  boundary  of  the  subset  of  Ik  where  'H(x)  is  zero  has  measure  zero  in  9ifc. 

c)  'I'(x)  is  continuous  everywhere  in  Ik  except  on  a  set  that  has  measure  zero  in  3?*. 

d)  If  R  is  a  set  of  measure  zero  in  Ik.  then  JR  j^(x)|  dx  —  0  . 

Assumptions 

2)  For  each  i  €  {1,. . . and  z  £  0,  x)  is  a  type  1  function  in  Id,-  Further¬ 
more,  for  each  i  €  { 1 , 

max  /  x)|  dx 

refi  Jidi 

is  finite. 

1.3.3  Data  Assumptions 

We  assume  the  following  conditions  on  the  data  functions  {<7,}. 

Notation 

Let  Cm{Id,)  be  the  set  of  all  functions  that  have  continuous  mth  order  partial  derivatives 
on  the  set  Id,.  Let  Kd,(rn)  be  a  set  of  d,-vectors  whose  components  are  nonnegative 
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integers  that  sum  to  m,  i.e. 


Kd.(™)  = 


d, 


n  -  (px,...,pd%)  and  = 

i= i 


m 


Let  x  =  (x1? . . .  .x^,).  Let  xA  =  x?x?  •  •  •  x^1 .  Represent  a  particular  mth  order  partial 
derivative  of  a  function  g  6  Cm(3?^)  by12 


g(ii){x) 


d* 
dx * 


g{*) 


d ^ 

dx? 


d ^ 


9(x) 


y 


where  p  €  Kdt{m). 

Let  Vdi  be  the  gradient  operator  for  3?^'.  Thus, 


Define  Vj”>)  recursively  by 

where  ®  is  the  tensor  product  [Hal74],  If  g  £  Cm{Xd').  then  the  elements  of  Vl£]g(x) 
are  the  mth  order  partial  derivatives  of  g  at  x.  Let  be  a  vector  whose  elements  are 

some  ordering  of  the  elements  of  . 


Assumptions 

3)  For  each  i  €  {1, . . . ,  /},  we  assume  that  gi  is  known  to  be  some  member  of  a  set 
Gi  defined  in  the  following  way.  Gi  is  the  set  of  all  functions  g  satisfying  the 
properties 

i)  g(x)  €  Cm' (/<*,) 

ii)  ||Vi7')5(x)||(l)  <  Mi(x)  for  all  x  €  /* 

where  mi  is  a  positive  integer,  ||  •  ||(,)  is  a  vector  norm,  and  Mi(x)  is  a  bounded, 
nonnegative,  type  1  function. 

The  simplest  example  of  a  bound  on  the  norm  is  a  uniform  bound. 

12This  is  the  multiindex  notation  of  Schwartz.  See  John  [Joh82]. 
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Example:  A  uniform  bound  on  the  m;th  order  partial  derivatives  of  gt  has 
the  form 

l|V^Wx)||(j,  <  B  Vx€ld.  , 


where  the  norm  is  a  vector  norm  for  vectors  of  the  appropriate  length  and  B 
is  a  nonnegative  real  number.  For  example,  if  d,  =  2  and  m,  =  2,  then  a 
global  bound  based  on  the  discrete  infinity  norm  is 


r 

d 2 

max  < 

8 2 

3x^X2 


9i{x) 


<  B 


for  all  x  6  Id,- 


1.3.4  Compatibility  Conditions 

The  PDE  will  also  be  well-defined  for  data  functions  other  than  the  given  ones.  We 
assume  the  following  sufficient  conditions  on  permissible  data  functions  for  the  PDE. 

4)  For  each  i  €  we  assume  that  any  member  of  G,  is  a  permissible 

data  function  for  the  ith  component  of  a  solution  to  the  PDE.  We  also  assume 
that  any  combination  of  data  functions  from  the  sets  {G,}  generate  a  possible 
solution  to  the  PDE,  with  one  possible  exception.  The  inclusion  of  a  given  data 
function  gi  in  a  set  of  data  functions  may  force  the  data  function  for  a  different 
component,  g j,  to  have  given  function  and  derivative  values  on  the  boundary  of 
Idj .  We  will  refer  to  this  as  a  compatibility  condition. 

Each  set  of  compatible  data  functions  drawn  from  the  sets  {G,}  define  a  solution  function 
u.  We  will  refer  to  the  set  of  all  solution  functions  generated  in  this  fashion  by  U . 


1.3.5  Example 


Let 
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Consider  the  following  elliptic  PDE  in  two  space  dimensions, 


,x,u  =  f(x)  on  £((0,0);  1) 
u(x)  =  h(x)  on  <9£((0,0);1)  . 


Then 

u{z)=  J  G(z,x)f(x)  dx  +  J  (-0- G{z,x ) 

B({0,0);1)  3B((0,0);1)  ' 

where  dG(z,x)/dn  is  the  normal  derivative  of  G  with  respect  to  x  on  the  surface  of  the 
unit  ball.  The  Green's  function  G(z,  x)  has  a  logarithmic  singularity  at  x  —  z,  and  is  a 
C°°  function  elsewhere  in  £((0,0);  1).  See  Butkovskiy  [But82,  pages  131-132]  for  more 
details. 


^  h(x)  ds  , 


To  put  this  operator  into  the  desired  form,  we  first  define  the  function  G(z,x)  to 
be  identically  zero  for  x  €  ([-1,1]  x  [-1,1])  -  £((0,0);  1).  The  extension  of  the  data 
function  /  into  ([—1, 1]  x  [—1, 1])  —  £((0,0);  1)  can  be  done  arbitrarily.  As  will  become 
clear  in  the  next  chapter,  it  is  best  for  the  analysis  if  the  extension  is  as  smooth  as  possible. 
Note  that  /2  is  mapped  onto  the  region  [-1,1]  x  [-1,1]  by  the  function 


¥>i(*ii*2)  =  (2  •  xj  -  1,2  •  x2  -  1) 


3£((0,0);1)  can  be  divided  into  two  submanifolds, 


T2  =  {(xi,x2)  jxi  <  0,Xi  +  x2  =  0} 


and 

?3  =  {(Zl,*2)  |  Zi  >  0,Xj  +  X2  =  0}  . 

The  unit  interval  is  mapped  onto  T2  by  the  function 

¥>2(x)  =  (cos(^  +  x  •  x)) ,  -  sin(^  +  tt  •  x)) 
Similarly,  the  unit  interval  is  mapped  onto  T3  by  the  function 

<P3(*)  =  (cos(^  -  x  •  x)) ,  sin(^  -i r  •  x) 
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Therefore,  the  solution  operator  can  be  rewritten  in  the  form 

“ (*)  =  G(z,y?i(x))/(y?i(x)) -4dx  +  ^—G(z,tf>2{x))jh(<p2(x))-vdx 

+  Jo  (J^G{z,<p3(x))j  h(v3(x))  ■  ndx  . 

By  the  notation  of  Section  1.3.1, 

=  4  •  G(a,¥>i(x))  , 

^2(z,x)  =  i r  ■  (Jj-G(z,<p2(x))j  , 

<I/3{z,x)  =  ir-  ^-G(z,<p3(x))J  , 

9\(x)  =  f(i),  92{x)  =  hiViix)) ,  and  53(1)  =  ^(^(x))  . 

If  z  E  B(( 0,0);  1),  then  it  is  clear  that  each  is  a  type  1  function  in  its  domain.  Note 
that  £2  and  g3  must  satisfy  the  compatibility  conditions  that 

52(0)  =  $3(1) 

and 

£2(1)=?3(0)  . 

If  h  is  known  to  be  differentiable,  then  compatibility  conditions  will  also  exist  on  the  value 
of  the  derivatives  of  g2  and  g3  at  x  =  0  and  x  =  1. 

1.3.6  Finite  Approximations 

Numerical  approximations  to  the  solution  of  the  PDE  replace  the  possibly  infinite  dimen¬ 
sional  problem  with  a  finite  dimensional  problem.  The  dimensionality  of  the  problem  is 
reduced  by  introducing  error  in  the  following  sense: 

•  Only  a  finite  amount  of  information  about  the  solution  function  is  calculated.  Any 
model  of  the  solution  based  on  only  this  information  will  only  approximate  the  true 
solution. 
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•  Only  a  finite  amount  of  information  about  the  data  functions  is  used  to  calculate 
the  desired  solution  characteristics.  We  will  refer  to  this  as  the  data  function 
sampling.  Unless  this  information  completely  characterizes  the  data  functions,  the 
solution  values  that  are  calculated  are  also  approximate. 

The  PDE  is  replaced  by  a  relationship  between  the  chosen  information.  The  error  in  the 
approximation  to  the  solution  is  also  a  function  of  what  this  relationship  is. 

For  the  rest  of  this  dissertation,  we  restrict  ourselves  to  the  case  where 

5)  values  of  the  data  functions  at  given  locations  in  their  domains  are  used, 

6)  values  of  the  solution  function  at  given  locations  in  its  domain  are  approximated, 

7)  the  error  in  approximating  each  solution  value  is  bounded  by  some  given  value 

t. 

We  assume  that  an  algorithm  a  calculates  the  value  of  the  solution  function  u  at  some  set 
of  locations,  Z  =  { z}  |  j  =  1, . . . ,  Wo  u}.  And,  for  each  data  function  git  the  algorithm 
uses  function  values  at  some  set  of  locations  X,  =  {x,,jt  |  k  =  1, . . . ,  JV0it}.  For  any 
particular  solution  value  u(zj),  the  algorithm  will  use  values  of  <7,  at  some  set  of  locations 

xiJ  =  {xij,k\k  =  l,...,Na,i(zj)}  C  Xi 

in  Id%.  Note  the  slight  change  in  notation  from  Section  1.2.2.  Instead  of  Na^(u(Zj)),  we 
use  Na,i(Zj)- 

Intrinsic  errors 

Let  u  be  some  solution  function  in  U ,  and  let  {<7,}  be  the  corresponding  data  functions 
for  u.  Let  «,•(£,•)  =  f[d  'fi(zj,x)gi(x)  dx.  For  a  given  i,  let  GatlJ(g,)  be  the  set  of  all 
data  functions  in  G,  satisfying  the  following  properties. 

•  If  g  €  Ga,ij(gi).  then  g  has  the  same  values  as  <7,  at  the  locations  {x,Jifc}. 

•  If  g€  Ga.ij(gi)  and  /  €  {0,...,mt},  then 

d*g{x)  _  d^gjjx) 
dx*  "  dxfi 

for  all  x  e  did,  and  all  p.  e  I\d,{l). 
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That  is,  if  g  6  Ga$ij(gi),  then  g  and  g{  have  the  same  values  at  all  sampling  locations, 
and  g  has  the  same  function  and  derivative  values  as  g,  on  the  boundary  of  Id,.  Since 
compatiblity  conditions  only  constrain  functions  on  the  boundary  of  their  domain,  and 
since  gi  is  necessarily  compatible  with  the  other  data  functions,  each  function  in  Ga,l<:{gx) 
is  also  compatible  with  the  other  data  functions. 

If  <7t,i,  <7;,2  €  Gaiij(gi),  then  the  difference  between  the  solution  values  associated  with 
these  two  data  functions  is 

/  ^i{zj,x){giA(x)  -  gi<2{x))dx  . 

J  id, 

It  is  impossible  to  distinguish  <7,  from  the  other  functions  in  Ga,Xij(gx)  merely  given  the 
information  {<7i(xi, .,,*)}■  Call  the  uncertainty  introduced  by  the  data  function  sampling 
the  data  sampling  error.  Let  Ua,i,}{gi)  be  the  set  of  values  generated  by  these  functions, 

Ua,i,j{gi)  =  {v,\v,  =  [  Vi(zj,x)g{x)dx  for  g  E  Ga,ij(</,)}  .  (1.2) 

Jid, 

Let  Da,iy]{gi)  be  the  maximum  difference  between  these  values, 


Da,i,j{gt)  =  max  >i  -  v2\ 

v»,l  ,Vit2  (g,) 


=  max  /  ^i(zjxx){g,A{x)  ~  gi^{x))dx 

Define  the  minimum  worst  case  data  sampling  error  for  the  ith  component  at  z:  to  be 
the  tightest  possible  bound  on  the  data  sampling  error, 


min  max 

3t,l€Ga,tlJ(3i)  Si ,2  (fli) 


I  ^,(ij,x)(p,.i(x)  -  gi,2{x))dx 

JU, 


This  represents  the  amount  of  uncertainty  in  u(zj)  when  only  the  ith  component  is  being 
approximated. 

Lemma  1.5  The  minimum  worst  case  data  sampling  error  for  the  ith  component  at 
zx  is  qreater  than  or  equal  to 

2 
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Proof:  Assume  that  there  exists  a  .  6  Ga,., _,(<?,)  such  that 
max  /  x)(^,-,.(x)  -  p(i))  dx 

g€Ga,,j(g,)  [dt 


Oa.i.j  (9i) 


Then 


max 


/  ^<(2i,x)(5,,1(x)  -^,2(x))<ix 

I 

-  ™ax,  Al,  Vi(zj,x){g{x)  -  gt'.{x))dx 

P€^»a, i  j  ($ft  )  I*/ 

/  ^,(2j,x)(y,,.(x)  -  <7(x))c/x 

Ia 


+  max 
s€Ga,i,j(s.) 


Og.i,A9i)  ,  Oq-.tjidi) 

2  2 
=  Da<iJ(gi)  ■ 

This  is  a  contradiction.  Therefore, 

max  /  #<(*,•,  *)($,>(*) -5(x))dx 

£€^a,«,./ (.ff* )  •'/rf, 

for  all  gi<m  €  and 

min  max  /  $,(2,,  x)(^,i2(x)  -  gl<2{x))  dx 

9*,1  €Go,»j  (j» )  9i,2€Ga,ij(9*) 


> 


> 


O  a.i.jjgi) 


We  can  similarly  define  a  minimum  worst  case  data  sampling  error  for  u  at  z:.  Let 
G'aij(gi)  be  the  set  of  all  data  functions  in  G{  that  have  the  same  values  as  gt  at  the 
locations  {x,^*}.  Let  U'a  i](gi )  be  the  set  of  ith  component  values  generated  by  these 
functions.  Let  Ua,Au)  be  the  set  of  all  solution  values  whose  data  functions  have  the 
same  value  as  those  for  u  at  the  sampling  locations, 


Ua,j{u)  =  <  V  V  =  Y^Vi,  Vi  6  U'a,i,j(gi)  f  fl U  • 


(1.3) 


i=0 


Let  Da^(u)  be  the  maximum  difference  between  these  values, 


DaJ{u)  =  max  |i>i  -  v2\ 

v\  ,V2€t/o,;(w) 
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Then  the  minimum  worst  case  data  sampling  error  for  u  at  z:  is 


min 


max 

U2€Ga,j(u) 


|vi  -  u2|  . 


Lemma  1.6  The  minimum  worst  case  data  sampling  error  for  computing  u(zj)  is 
greater  than  or  equal  to 


I 


E 


2 


Proof:  For  each  i,  Ga,t,j(gi)  C  G'ai<J(gi).  Moreover,  any  combination  of  data  functions 
from  the  sets  {(?<», is  compatible,  since  each  function  is  compatible  with  the  true 
data  functions.  Therefore,  if  a  function  v  has  the  form 

i 

v  =  J2v'  vi  €  tw*)  , 

«=o 

then  v(zj)  6  Uaj(u).  In  consequence, 


D*M  =  max  |vi-v2| 

VI  ,V2€UaiJ(  11 ) 

l 

=  ]C  ,  .  Kl  -  Vi>2\ 


>  E*W*>  • 


1=0 


The  same  argument  used  in  Lemma  1.5  proves  that 


I  t  ^  Da,j{u) 

mm  max  hi  —  vol  >  — - - 

v\^Uaj(u)  V2^Ua,,(u)  2 


Therefore, 


min  max  |ui  —  u2|  >  ^ 

i=0 


V2 £G«,,(u) 


^  7i) 

2 


Since  these  minimum  worst  case  sampling  errors  are  only  functions  of  the  sampling 
locations  and  the  PDE,  we  call  them  intrinsic  errors.  For  arbitrary  Gt,  these  tower 
bounds  on  the  intrinsic  errors  can  be  arbitrarily  large.  For  a  priori  estimates  of  the  error, 
some  assumptions  about  the  data  functions  are  necessary.  Assumption  3  on  page  19  is  a 
common  type  of  assumption  for  piecewise  polynomial  approximation  based  methods,  like 
finite  difference  [RM67]  [BL84]  and  finite  element  [SF73]  methods. 
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Modelling  errors 


Most  numerical  methods  introduce  more  than  the  minimum  worst  case  error  when  ap¬ 
proximating  the  solution  values.  Tighter  bounds  can  be  derived  by  examining  what  model 
a  particular  method  uses  to  represent  the  data  functions  from  the  given  values  For  ex¬ 
ample,  standard  Galerkin  methods  [Fle84]  [SF73]  [BTW84]  explicitly  introduce  a  model 
for  the  solution  and  each  data  function,  u  and  {^}.  Even  when  the  solution  is  the  only 
function  being  explicitly  modelled  in  the  algorithm,  the  error  introduced  by  the  model  can 
often  be  interpreted  as  error  imposed  upon  the  data. 

Each  model  of  a  data  function  introduces  error  into  the  approximate  solution  in  a 
fashion  similar  to  the  intrinsic  error.  For  simplicity,  assume  that  exactly  one  data  function 
model  is  used  to  approximate  gi  in  the  calculation  of  a  given  solution  value  u(z:).  Call 
this  model  Then  the  minimum  worst  case  error  in  computing  the  itb  component  at 
Zj  is 

max,  Jr  *i(2j,x){g(x)-gij{x))dx  . 

9eGaii,j(s,)  Jidt 

To  bound  the  modelling  error  for  the  data  functions  satisfying  the  assumption  on 
page  19,  any  model  must  satisfy 


[  Vi(zj,x)(gi,j-g)dx  =  0 


for  all  polynomials  g  of  order  less  than  m,.  This  is  a  simple  variant  of  Theorem  3.1 
in  [TW80,  page  31].  It  is  sufficient  if  the  model  jjij  is  an  m,th  order  accurate  approxima¬ 
tion  to  the  data,  i.e.  all  polynomials  of  degree  less  than  rm  are  represented  exactly. 


1.3.7  Linear  Algorithms  and  Serial  Complexity 


For  an  algorithm  a,  let  gt  be  a  vector  made  up  of  the  values  of  g ,  used  by  the  algorithm. 
If  a  is  linear,  then  the  finite  dimensional  problem  can  be  expressed  by  a  matrix  equation 
of  the  form 

F 

Au  =  YlR<9i  ,  (1.4) 

i=0 

where  u  is  a  vector  containing  the  solution  values  to  be  approximated.  If  all  elements  of 
u  are  nontrivial  functions  of  the  data,  then  A  must  be  expressible  as  a  square  nonsingular 
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matrix.  If  A  is  an  N  x  N  matrix  and  is  an  N  x  M,  matrix,  then  N  >  Na<u  and 
=  Naj.  The  serial  complexity  of  solving  this  matrix  equation  is  a  function  of  the 
properties  of  the  matrices,  but 

/(/)  •  N  +  /(-)  •  Yi N**  +/(+)•  max  (£Na,i(*j)  ~  l]  U-5) 

1=0  ;  \t=0  J 

is  a  lower  bound.  The  first  two  terms  follow  from  the  representation  of  the  matrix  problem, 
and  the  last  term  follows  from  Lemma  1.2  on  page  15.  Here  /(,),  /(/),  and  /(+)  are  the 
times  to  execute  a  floating  point  multiplication,  division,  and  addition  respectively.  Note 
that  the  second  term  is  a  lower  bound  on  forming  the  right  hand  side,  and  the  first 
term  is  a  lower  bound  on  solving  the  matrix  equation.  This  expression  assumes  that  the 
coefficients  in  Ri  and  A  that  are  equal  to  1  are  either  not  known,  or  not  taken  advantage 
of. 

Explicit  system 

If  A  is  a  diagonal  matrix,  then  this  is  an  explicit  method.  N  =  Na<u  for  an  efficient  serial 
algorithm.  If  we  assume  that  u  is  ordered  so  that  u:  is  the  approximation  to  u(zj),  then 
Na<t(zj)  is  the  number  of  nonzero  entries  in  the  jth  row  of  /?,.  Each  multiplication  of  a 
row  of  R,  by  g,  requires  Na<i(zj)  multiplications  and  N0ti(zj)  —  1  additions.  Therefore, 
the  complexity  of  the  straightforward  method  for  calculating  the  solution  of  the  matrix 
equation  is 

Nu  f  J  \ 

El/(/)  +  E(/(-)  ''v...(y  +  /w(JV«(i>)-1))J  ■  o-6> 

Again,  the  possibility  of  nonzero  entries  being  known  to  have  the  value  of  1  is  ignored.  If 
the  Rx  matrices  have  special  properties,  then  this  serial  complexity  can  be  reduced.  The 
classic  example  is  the  fast  fourier  transform  method,  which  reduces  the  se^al  complexity  of 
a  matrix-vector  multiplication  fora  particular  dense  matrix  from  0(A'2)  multiplications  and 
additions13  to  0(iVlogiV)  multiplications  and  additions.  See  [AHU74]  for  a  description 

13For  two  real  valued  functions  /  and  g,  f(x)  =  0(^(x))  if  there  exist  positive  constants  ci,  c 2, 
and  xo  such  that 

ci ls(x)l  <  l/(*)l  <  c2|?(r)| 

for  all  x  >  xo  [HS78] . 
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of  this  method. 

Implicit  system 

If  A  has  more  than  one  nonzero  element  in  at  least  one  row,  then  this  is  an  implicit 
method.  Na^{z3)  may  no  longer  simply  be  the  number  of  nonzero  entries  in  the  yth  row 
of  /?,.  For  many  PDEs,  the  best  known  serial  algorithms  are  implicit.  Some  of  the  work 
in  good  serial  algorithms  for  solving  the  matrix  equation  is  common  to  the  calculation  of 
multiple  solution  values.  The  serial  complexity  of  calculating  a  single  solution  value  will 
generally  be  larger  for  an  implicit  method  than  for  a  good  explicit  method,  but  the  shared 
work  when  calculating  all  of  the  solution  values  can  make  it  the  better  algorithm. 

For  an  implicit  method,  the  serial  complexity  is  a  function  of  both  evaluating  the  right 
hand  side  and  solving  an  equation  of  the  form  Ax  =  b.  A  general  matrix  equation  solver 
has  a  serial  complexity  of  0(iV3)  multiplications  and  additions.  But  matrix  equations 
generated  by  piecewise  polynomial  models  for  elliptic  PDEs  can  be  solved  using  multigrid 
algorithms  in  Q(N)  multiplications  and  additions.  Both  methods  have  O(N)  floating 
point  divisions.  Other  solvers  for  the  matrix  equation  have  complexities  in  between  these 
two  extremes,  and  all  depend  on  assumptions  about  the  properties  of  A.  We  will  use  the 
expression 

c-  N1 

to  represent  a  bound  on  the  number  of  multiplications  and  additions  for  some  as  yet 
unspecified  matrix  equation  solver.  We  will  always  assume  that  7  6  [1-3]. 

Other  costs 

This  description  of  the  serial  complexity  has  ignored  the  cost  of  computing  the  elements 
of  the  matrices  A  and  {/?,}.  This  can  be  considerable,  although  it  is  usually  a  linear 
function  of  the  number  of  nonzero  elements.  Since  it  can  be  computed  a  priori,  we  will 
continue  to  ignore  it.  It  will  not  affect  the  lower  bounds  described  in  the  later  chapters. 
But  it  can  influence  the  comparison  of  different  methods. 
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Chapter  2 

Information  Requirements 


Let  Z  be  a  finite  set  of  locations  in  fl.  For  a  PDE  satisfying  assumptions  1-4  in  Section  1.3, 
define  A  to  be  the  class  of  serial  algorithms  that  approximate  the  solution  of  the  PDE 
and  satisfy  the  following  condition: 

•  Each  algorithm  is  based  on  a  finite  approximation  satisfying  assumptions  5-7  on 
page  22. 

Let  A((Z )  be  the  subset  of  A  satisfying  the  following  additional  condition: 

•  Z  is  a  subset  of  the  locations  where  the  solution  function  is  approximated.  And,  for 
all  u  €  U,  the  error  in  approximating  u(z)  at  each  location  z  €  Z  is  bounded  by  c. 

As  in  Section  1.3.6,  let  Na^{z)  be  the  number  of  values  of  <7,  that  are  used  by  an  algorithm 
a  when  approximating  ti(z).  Let  JVt  ,(i)  be  the  minimum  number  of  values  of  <7,  required 
by  algorithms  in  At({z}), 

NCti{z)=  min  Na,i(z)  . 

a£At({z}) 

That  is,  all  algorithms  in  A  that  approximate  u(z )  and  satisfy  an  error  tolerance  of  e  must 
use  at  least  Ntt,(z)  values  of  <?,. 

Lemma  2.1  Let  /(+)  be  a  lower  bound  on  the  execution  time  of  a  floating  point 
addition  in  a  given  multiprocessor.  Then  the  expression 

■  max 

z£Z 
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is  a  lower  bound  on  the  parallel  complexity  of  any  parallel  implemetation  of  an  algo¬ 
rithm  a  £  Ae(Z )  on  that  multiprocessor. 


Proof:  If  z  £  Z,  then  At(Z)  C  At({z}).  Therefore,  Naj(z)  >  N(i,(z)  for  all  z  £  Z  when 
a  £  Ae(Z).  From  Lemma  1.4  on  page  15,  a  lower  bound  on  the  parallel  complexity  of  a 
parallel  algorithm  a  is 


/(+) 


max 

J€{1 


log2 


i  i=0 


Therefore,  a  lower  bound  on  the  parallel  complexity  of  any  parallel  implementation  of  any 
algorithm  a  £  At(Z)  is 


/(+) '  max 


.1=0 


I 

In  this  chapter  we  calculate  upper  and  lower  bounds  on  JVe,(z)  of  the  form 


when  AT£tj(f)  ^  0  and  A/,(x)  >  0  for  all  x  G  Id,-  We  then  define  optimal  parallel 
algorithms  for  the  set  A^Z),  and  use  the  bounds  on  Ar£,,(z)  to  prove  the  following  lower 
bound  on  their  parallel  complexity.  If  NCti(z)  >  C  ■  for  even  one  z  £  Z  and  one 

i  £  {0,...,/},  then  the  parallel  complexity  of  optimal  parallel  algorithms  must  increase 
linearly  as  a  function  of  d{  ■  log2  e-1  when  the  error  tolerance  goes  to  zero.  We  also  show 
that  there  is  always  a  parallel  algorithm  based  on  an  explicit  linear  algorithm  whose  parallel 
complexity  is  within  a  constant  factor  of  this  lower  bound. 


2.1  Lower  Bound 

Assume  that  a  £  A.  Let  u  £  U  be  a  solution  function,  and  let  {(?,}  be  the  corresponding 
set  of  data  functions.  As  in  Section  1.3.6,  define  Ga,t,j(p.)  to  be  the  set  of  all  data 
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functions  in  (5,  that  have  the  same  values  as  <7,  at  the  locations  {£«.>,*},  and  have  the 
same  function  and  derivative  values  as  5,  on  the  boundary  of  Id, ■  Define  Da<it:(gt)  by 


Daiij{gi)  =  max 

Si,  1  .Si, 3  feL»a,i,j  (3, ) 


/  *)(s.,i(z)  “  9iAx))  dx 

J  l A 


Let  0  be  the  function  that  is  identically  zero  on  Id,-  By  our  assumptions  about  the 
data  functions,  0  €  G Let  A°i({zj})  be  the  subset  of  A  that  satisfies  the  following 
condition.  If  a  €  A°({zj}),  then  Z?a^j(0)  <  2  •  e.  Define  N°t(zj)  by 


Lemma  2.2  For  all  i  €  {0, . . . ,  /}, 

>  K&)  • 


Proof:  Let  i'  be  some  element  of  {0, Assume  that  a  £  A,  but  that  a  £  4°, •<({£,}). 
Therefore,  Dayj( 0)  >  2  •  e. 

By  the  definition  of  U  and  the  assumptions  on  G,<,  there  exists  some  solution  function 
u  €  U  for  which  g?  =  0.  By  Lemma  1.6,  the  minimum  worst  case  data  sampling  error  in 
approximating  this  solution  function  at  z,  is  bounded  from  below  by 


Da,'A0) 


+ 


Pg,i,j{9i) 

.«o  2 


(2.1) 


Since  Daii>j(gi)  >  0  for  all  i,  this  bound  is  itself  bounded  from  below  by  DayA 0)/2. 
Therefore,  the  lower  bound  on  the  error  in  expression  2.1  is  strictly  greater  than  e  for  this 
solution  function,  and  a  £  At({zj}).  Since  a  was  an  arbitrary  algorithm  not  in  ^^({z^}), 
this  implies  that  At({zj})  C  A°it({zj}).  It  follows  immediately  that 


Nc,,(*i)  >  ' 


Since  i'  was  arbitrary,  this  proves  the  Lemma.  I 

In  the  rest  of  this  section,  we  calculate  lower  bounds  on  the  number  of  sampling 
locations  necessary  to  satisfy 

Dait,}( 0)  <  2-e  . 

The  major  tool  in  this  analysis  is  the  following  lemma. 
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Lemma  2.3 


2  •  max 

'r<6Ga,,,j(0) 


Proof:  If  G  Ga,,j(0),  then 


=  |^r’i'U*)|(0  <  Mm 

for  all  x  e  Id,-  It  is  clear  that  (—7,..)  and  its  derivatives  are  zero  at  the  same  locations 
that  7 it.  and  its  derivatives  are  zero.  Therefore,  (-7 «>)  G  Ga,i,j(0)  and 


max 

Ti.i  (0) 


/  «,-(* j,  x)(7i,i(^)  ~  li,2(x))  dx 

JI*i 


-  Bax  m  lr  *)  (7.>(*)  -  (-7U*))) 

=  2  ■  2P*  I [  ^ 


Also, 


max 

■>•,1  ,'ri,2€Ga,,,J(0) 


/  ®i(*j.*)(7i,i(x)  -  7<.2(*))  dx 

Ju, 

-  ^ax  ^  *)7i,i(x)  dx 

7i,i€C?OtliJ(0)  |y/d< 

=  2  ■  r?.ax 


4-  max 

-r.,2eG0 


X...  /  *<(f;»*)(-7i,2(*))rfx 

•'id, 


1 

We  will  calculate  a  lower  bound  on  Do,»,j(0)  for  a  given  set  of  data  sampling  locations 
by  constructing  a  C°°  function  T  G  GOilJ(0).  We  will  show  that  this  lower  bound  on  the 
intrinsic  error  is  a  function  of  m,  and  the  distance  between  the  sampling  locations.  From 
this,  we  calculate  a  bound  on  the  number  of  sampling  locations  we  must  use  to  satisfy  a 
given  error  bound.  To  quantify  the  measure  of  distance,  we  introduce  the  Voronoi  diagram 
of  a  set  of  sample  locations. 
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2.1.1  Voronoi  Diagram 

We  briefly  describe  Voronoi  diagrams,  and  mention  some  of  their  properties.  See  also 
(GS85J. 

Discrete  sampling  locations 

Consider  a  set  of  n  distinct  sampling  locations  {x*}  in  dtd.  The  Voronoi  diagram  as¬ 
sociated  with  these  locations  is  a  partition  of  made  up  of  n  open  sets  { V(x *)}  and 
one  closed  set,  B.  A  point  is  in  V(x/)  if  it  is  closer  to  x*  than  it  is  to  any  other  sample 
location,  i.e. 

v(*i)  =  {y  I  Pi  -  y|U  <  \\xk  -  j/lh ,  vfc  ^  /}  . 

||  •  ||2  is  the  Euclidean  norm  for  $ld.  We  will  call  V(x/)  a  Voronoi  cell,  and  refer  to  ij 
as  the  center  of  the  cell.  B  is  the  complement  of  the  union  of  the  Voronoi  cells.  Every 
point  on  B  satisfies  the  condition  that  the  nearest  sampling  location  is  not  unique.  That 
is,  each  point  is  equidistant  from  at  least  two  of  the  sample  locations. 

The  Voronoi  diagram  can  be  constructed  in  the  following  fashion.  Consider  two  loca¬ 
tions  in  9td,  Xj  and  x2.  Partition  this  space  into  three  subregions,  the  (d—  1)  dimensional 
hyperplane  all  of  whose  points  are  equidistant  from  both  xi  and  x2,  and  the  two  open  half 
spaces  defined  by  this  hyperplane.  Let  H i,2(xi)  and  i/i, 2(x2)  be  the  half  spaces  contain¬ 
ing  §!  and  x2  respectively.  For  a  set  of  n  locations  {£*},  consider  a  particular  location 
x/  €  {ifc}.  Identify  the  half  spaces  and  bisecting  hyperplanes  constructed  by  a  pairwise 
analysis  of  i|  and  each  of  the  other  n  —  1  locations.  Then  V(xi)  is  the  intersection  of  all 
of  those  halfspaces  that  contain  that  location, 

V(x,)  =  p|  Hltk(x,)  . 
k*l 

Thus,  V(xi)  is  a  convex  polygon.  All  other  Voronoi  cells  are  constructed  in  an  analogous 
way.  See  Figure  2.1  for  an  example  Voronoi  diagram  in  3?2. 

Smooth  surfaces 

We  can  also  extend  the  concept  of  a  Voronoi  diagram  to  include  a  smooth  surface.  For 
example,  let  S  be  a  (d—  1)  dimensional  difFerentiable  surface  in  3?^,  and  let  IF  be  a  set  of 
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Figure  2.1:  Example  Voronoi  diagram  in  ft2. 

sampling  locations  bounded  away  from  the  surface.  Treat  every  point  on  S  as  a  sampling 
location.  The  Voronoi  cell  associated  with  a  location  *'  €  S  is  again  the  set  of  points  in 
ftd  that  are  closer  to  x'  than  to  any  other  sampling  location, 

V{x')  =  {y  |  ||*'  -  y||2  <  ||*  -  y||2  ,  V*  £  S  U  W  ,  *  /  *'}  . 

If  *'  is  in  the  interior  of  S  then  V(*')  is  a  segment  of  the  line  that  is  normal  to  5  at 
*'.  The  segment  is  only  bounded  if  there  is  at  least  one  sampling  location  in  each  half 
space  defined  by  the  plane  tangent  to  S  at  *'.  V(x’)  is  not  an  open  set  in  ftd,  but  it 
is  isomorphic  to  an  open  set  in  ft1.  And  B  is  still  a  closed  set.  The  Voronoi  cells  for 
locations  off  the  surface  may  no  longer  be  polygons,  but  they  are  still  convex.  Note  that 
this  Voronoi  diagram  is  a  limiting  case  of  the  previous  construction. 

Sampling  in  a  bounded  region 

For  a  final  generalization,  consider  a  finite  discrete  sampling  {**}  within  a  compact  region 
C  C  ftd  whose  boundary  dC  is  the  union  of  a  finite  number  of  smooth  (d- 1)  dimensional 
surfaces.  Define  the  Voronoi  diagram  for  these  sample  locations  by  first  assuming  that 
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every  location  on  the  boundary  is  also  a  sampling  location,  and  then  taking  the  intersection 
of  the  resulting  Voronoi  diagram  with  C .  The  Voronoi  cells  for  the  interior  locations  are 
naturally  contained  in  C,  so  only  the  Voronoi  cells  with  centers  on  the  boundary  are 
altered  by  being  restricted  to  C. 

Definitions  and  notation 

Consider  a  finite  set  of  sampling  locations  {x*}  in  a  compact  region  C  whose  boundary  is 
the  union  of  a  finite  number  of  smooth  (d— 1)  dimensional  surfaces.  Let  W  =  {xfc}U<9C. 
Let  V(w),  w  G  W,  be  the  closure  of  V(w)  in  The  neighbors  of  a  cell  V(w')  are 
those  cells  whose  closures  intersect  with  the  closure  of  V{w'),  i.e. 

{  V ( tv )  |  V (w1)  n  V(w)  ±  0  ,  w  €  W ,  w  ^  w}  . 

The  neighbors  of  a  sampling  location  w'  are  those  sampling  locations  w  G  W  for  which 
V(w)  is  a  neighbor  of  V(u>'). 

Let  the  boundary  of  a  cell  V(w'),  dV(w’),  be  the  intersection  of  B  with  the  closure  of 
V(w').  Every  point  on  the  boundary  of  one  cell  is  also  on  the  boundary  of  a  neighboring 
cell.  Thus,  every  location  in  the  boundary  of  V(w')  is  equidistant  from  u>'  and  the  center 
of  one  of  its  neighbors,  and  is  no  closer  to  any  other  center.  Let  r(w')  be  the  maximum 
distance  between  w'  and  the  points  on  the  boundary  of  V(u>'),  i.e. 

r(u>')=  max  Jlur'-yllj  . 

»e3V(u/) 

We  will  use  this  function  as  a  measure  of  the  distance  between  sample  locations. 

2.1.2  Error  Function 

In  this  section  we  construct  a  function  T  G  Co,i,j(0)  using  the  Voronoi  diagram  corre¬ 
sponding  to  a  set  of  data  sampling  locations. 

Error  function  in  Idt 

Let  B{x\8)  be  the  closure  of  B(x;8),  the  d{  dimensional  ball  centered  on  x  with  radius 
8.  Choose  a  location  x.  in  the  interior  of  Id<  and  a  8  >  0  such  that  B(x.;8)  C  Id,  and 
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^t(zvx)  is  continuous  and  nonzero  on  B(x.;8).  Thus,  since  B(x.\8)  is  a  compact  set, 
|'Pi(£j,x)|  has  a  nonzero  minima  in  B(x.,8). 

If  such  a  ball  does  not  exist,  then  Ui(zj)  =  0  and  Nc,i(zj)  =  0.  This  follows  from 
'Pi  being  a  type  1  kernel.  The  integral  over  the  set  of  measure  zero  where  'P,  is  singular 
or  discontinuous  has  no  contribution,  and  'P,  is  zero  elsewhere.  Thus,  any  error  can  be 
tolerated  in  the  data  function.  For  now,  assume  that  Nt,i(zj)  ^  0. 


Lemma  2.4  If  di  >  0  and  no  sampling  locations  are  located  in  B(x.\8),  then 


£><,,  , (0)  >  2  •  Ci  •  min  |'P,(i,-, x)|  •  min  A/,(x)  •  8 

xC.B(x.,S)  xeS(r.,5) 

where  Ci  is  a  positive  constant  dependent  only  on  mi,  </,,  and  | 


mi+di 


To  prove  this,  we  need  the  following  fact. 


Lemma  2.5  Assume  that  d,  >  0.  For  any  x,  6  and  8  >  0  such  that  B(x.;6)  C 
Id, ,  there  exists  a  function  7  €  Cco(Id, )  with  the  following  properties. 


1)  7  e  Gi. 

2)  7(1)  =  0  for  all  x  ^  B{x,,8). 

3)  7(i)  >  0  for  all  x  €  B(x»\8). 

4)  There  exists  a  constant  Ci  >  0  depending  only  on  mi,  di,  and  ||  •  ||(j)  such  that 

[  7 (x)  dx  >  Ci  ■  8m,+d'  ■  min  Mi(x)  . 

J  Id,  x^B(x.\S) 


The  proof  of  Lemma  2.5  is  a  straightforward,  but  somewhat  lengthy,  construction.  See 
Appendix  A  for  the  details.  The  proof  of  Lemma  2.4  follows  directly  from  Lemma  2.5. 

Proof  of  Lemma  2.4:  Let  7f,  be  the  function  satisfying  properties  1-4  of  Lemma  2.5  on 
the  ball  B(xm;  8).  Since  there  are  no  sampling  locations  in  B(x. ;  6),  and  since  7 j.(x)  =  0 
for  x  £(x;<5),  -y£t  is  also  a  member  of  GajI>j(0).  Since  7i.(x)  is  nonnegative  and 
vp,(xj,x)  is  of  one  sign  in  B(x.;8), 

I  *.(x)dx  =  ±/  |'P,(fJ,x)|7i.(x)dx  . 

J B(x.  ;6)  J B(x.  ;6) 
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Therefore,  by  Lemma  2.3, 


Amj(0)  > 

2  • 

max 

'y€Ga.,J(0) 

^  'Hi(zj,x)~f(x)dx 

> 

2 

JB(s.\S ) 

> 

2 

■_  L.  c,7**(x)dx 

x€B(x*\6)  J B(xm  ;5) 

>  2  •  Ci  •  min  I  $\(x,-,x)|  •  min  Mdx)  ■  Sd,+!n'  . 

This  proves  the  Lemma.  I 

To  decrease  the  error  below  the  bound  in  Lemma  2.4,  the  data  function  must  be 
sampled  in  B(x.\6).  Assume  that  B(x.;6)  contains  n  data  samples  {xjt}.  As  described 
in  Section  2.1.1,  these  locations  and  the  boundary  of  the  ball  define  a  Voronoi  diagram 
in  B(xm;S).  Let  W  =  {xfc}.  Construct  a  function  T  €  Gaiij(0)  in  the  following  way: 

a)  For  each  Voronoi  cell  V(w),  w  6  W,  identify  a  location  y&  on  its  boundary  where 
the  function 

||y-u)||2  for  yedV(w) 

reaches  its  maximum,  i.e. 

~  u>||2  =  r(w )  . 

b)  Let  W  ~W ,  W'  =  0,  and  rold  =  0. 

c)  Choose  an  element  w1  G  W  that  maximizes  r(w)  over  the  set  W .  Since  y &  G 
dV(w'),  it  is  no  closer  than  r(u>')  to  any  of  the  other  locations  in  W .  Therefore 
B(y'x-,r(w'))  contains  no  sample  locations.  Let  'v  be  the  function  described  in 
Lemma  2.5  for  the  ball  £(£,*;  r(xD'))-  Add  w'  to  W' ,  and  set  Pnew  =  rold  4-  7^. 

d)  Let  S(w')  be  the  subset  W  such  that  —  tZ>||  <  3r(u)')  if  w  €  S(w').  Remove 
the  locations  in  S(w')  from  W.  Set  roid  equal  to  rnew. 

e)  If  W  is  the  empty  set,  then  stop.  Otherwise,  return  to  step  c. 
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Upon  termination,  W  =  0  and 


r(z)  =  Y  7w-( x )  •  (2.2) 

w'ew 

Let  n'  be  the  number  of  elements  in  W.  The  following  three  lemmas  verify  that  the 
construction  is  well-defined,  and  that  T  has  the  desired  properties. 

Lemma  2.6  The  construction  of  T  is  a  well-defined  process. 

Proof:  For  step  a,  we  must  be  able  to  find  a  ya  for  any  Voronoi  cell  V(w).  Since  the 
function  ||y  —  u>||2  is  continuous  for  y  €  and  the  boundaries  of  these  Voronoi 

cells  are  closed  bounded  sets,  there  exists  some  y &  such  that  lly*  —  u>||2  =  r( w)  for  each 
w  6  W. 

For  step  c,  we  must  be  able  to  find  a  w'  6  W  that  maximizes  r(w )  over  this  set. 
There  are  never  more  than  n  sampling  locations  in  W,  so  the  cell  that  maximizes  r'w) 
in  this  set  is  well-defined. 

Finally,  step  d  always  removes  at  least  one  element  from  W .  Since  it  begins  with  n 
elements,  the  construction  terminates  after  no  more  than  n  additions  to  T.  I 

Lemma  2.7  The  function  T  is  the  sum,  of  a  finite  number  of  nonnegative  C°°(Id,) 
functions  {7*}  each  of  which  satisfies  the  following  three  conditions. 

1)  The  support  of  7*  is  wholly  contained  in  B(xm;6). 

2)  7 k  €  Gajj(0). 

3)  The  support  of  7*  is  disjoint  from  the  support  of  the  other  functions  in  the  set. 

Proof:  T  is  the  sum  of  the  n'  functions  {7^  |  w1  e  W'}.  By  the  argument  in  Lemma  2.6, 
n'  <  n,  and  so  is  finite.  From  Lemma  2.5,  we  know  that  each  7^  is  a  nonnegative 
C°°(Idi)  function,  and  a  member  of  G,. 

We  also  know  that  7^(1)  =  0  if  x  £  r(u>'))  for  each  w'  £  W'.  Since 

y&  is  no  closer  than  r^tu')  to  any  other  Voronoi  cell  center,  and  since  all  locations  in 
dB{x.\  6)  are  centers  of  Voronoi  cells,  y#  is  no  closer  than  r(w')  to  dB{x.\ 6).  Therefore 
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B(y&\ r(u>'))  C  B(xm;8),  and  the  support  of  each  7^.  is  restricted  to  Biy^r,  r(u?')).  This 
proves  the  preamble  and  statement  1. 

From  step  c  of  the  construction,  we  know  that  B(y&;r(u>'))  contains  no  sample 
locations  that  are  strictly  inside  B(x,\8).  And,  from  the  previous  paragraph,  we  know 
that  B(y&\r(w'))  contains  no  sample  locations  that  are  outside  of  B{x.\8).  Therefore, 
7ux  is  zero  at  all  sampling  locations.  Also  from  the  previous  paragraph,  we  know  that  each 
7^  is  identically  zero  outside  of  B(xm\  6).  Since  7^  is  a  C°°  function,  all  derivatives  of  7^ 
are  zero  outside  of  B(xm;8),  and  7^  satisfies  any  compatibility  conditions  associated  with 
the  data  function  0.  Therefore  7 &  £  Ga.,j(0)  for  all  tv'  £  W' .  This  proves  statement  2. 

Assume  that  y  £  I 4.  is  in  the  support  of  both  7^'  and  7^  for  w':,w'k  £  W' .  Since 
the  support  of  each  7^  is  restricted  to  B(y^ ;  r(u>')), 

lly*;  -  y*'„ Ih  <  r(u>')  +  r(u)fc)  . 

Therefore 

IK  -  J/*;||2  <  2r(u>')  +  r(tZ>i) 

and 

K  -  H2  <  r(u>')  +  2r(w'k)  . 

Without  loss  of  generality,  assume  that  7^  was  added  to  T  first.  Then  r(w'k )  <  r(tZ>'), 
and  ||u)*  —  y^ ||2  <  3r(ti)').  By  step  d  of  the  construction,  w'k  would  have  been  removed 
from  W  immediately  after  7^  was  added,  and  thus  would  never  have  been  used.  This 
leads  to  a  contradiction.  Therefore  we  know  that 

II yw't  -  y*'kh  >  rK)  +  r{w'k)  , 

and  that  Bfy^S)  fl  Biy^',8)  =  0  for  all  iD',  wk  £  W' .  This  proves  the  last  statement 
of  the  Lemma.  I 

Lemma  2.8  T  is  a  nonnegative  C°°{Idx)  function  that  is  a  member  of  Ga,ij{ 0),  and 
whose  support  is  wholly  contained  in  B(x*;8). 

Proof:  By  statements  1  and  2  of  Lemma  2.7,  T  is  the  sum  of  a  finite  number  of  nonneg¬ 
ative  C°°  functions  that  are  elements  of  Go>1J(0),  and  that  are  identically  zero  outside 
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of  B(xm\6).  Therefore  T  is  also  a  nonnegative  C°°  function,  T(x)  =  0  if  x  £  {x*},  and 
T(x)  =  0  if  x  B(xm-,6).  All  that  is  left  to  prove  is  that 

l|v't""lr(i)||(j)  <  Mj(x) 


for  all  x  £  Id,- 

From  equation  2.2, 


§prW=  E 


w'ew 


dxa 


for  any  multiindex  s.  By  statement  3  of  Lemma  2.7,  the  support  of  the  functions  {7^} 
are  pairwise  disjoint,  and 

—T(x)  =  [  )  ’  for  i  6  B(y^;r(w')) ,  w'  £  W'; 

dx M  '  - 


10, 


otherwise, 


for  p  £  I<d,{rnx).  Therefore, 

(  ,  for  x  €  B(y*>\ r(u>')) ,  w'  £  W'\ 
otherwise. 

Since  7*'  €  G,  for  all  w'  £  W',  || V^,)r(x)||(t)  <  A/,(x)  for  all  x  £  Id,-  I 

The  next  lemma  implies  that  the  support  of  T  covers  a  significant  portion  of  B(x,;6). 
The  proof  is  in  AppendixB. 

Lemma  2.9 


v^nxil! 


r7(m* ) 


7  w'(x) 


B(x.;6)C  1J  B(y&;5r(w'))  . 


The  major  result  concerning  this  construction  is  the  following  theorem.  It  describes  how 
the  lower  bound  we  have  constructed  can  vary  as  the  sample  locations  vary. 


Theorem  2.1  Assume  that  N^i(zj)  ^  0.  If  dx  =  0,  then  Nfj(zj)  =  1.  If  di  >  0, 
let  B(x.\6)  C  Id,  he  a  closed  ball  in  which  ^X(zj,x)  is  nonzero  and  continuous  as 
a  function  of  x.  If  n  sample  locations  lie  in  the  open  ball  B{x.\8),  then  Da<x^{  0)  is 
bounded  from  below  by 


2  ■  C,  ■  min  |'Ifl(z,,x)|  •  min  M,(x) 

x  eB(z.,6)  z^B(x,,s) 


6m'+d'  ,  ifn  =  0; 

ijn>  0. 
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where 


^>o 


0  <  rk  <  6  for  1  <  k  <  n  . 

C,  is  i/ie  constant  from  Lemma  2.5  for  a  d{  dimensional  domain. 

Proof:  If  di  =  0,  then  Id,  is  a  single  point,  and  Ne,t(zj)  is  either  zero  or  one.  Assume 
that  d  >  0. 

If  n  =  0,  then  the  Theorem  follows  directly  from  Lemma  2.4.  If  n  >  0,  then  let  {tijj.} 
be  the  sample  locations  used  in  the  construction  of  T.  Let  n'  be  the  number  of  sample 
locations  used.  By  Lemmas  2.3,  2.5,  2.7  and  2.8,  and  equation  2.2, 


(0)  > 

2 

max 

'r€G0i,iJ(6) 

> 

2 

■1/  V 

> 

2 

min  | 

x  €fl(r.  -.6) 

> 

2 

min  | 

x  €B{x,\6) 

>  2  Q-  min  |*,(i„x)|.  mill 
x€B(xk\6)  x€B(z0;6) 


By  Lemma  2.9, 


Therefore, 


B{x.;5)  C  (J  B{y*'k,5r{w'k))  . 


Vol  (B(x.;6))<  |J  Vol(B(^k;5r(w'k)))  , 


Sd'  <  £(5rK))*  ■ 

k=  1 

Setting  rfc  =  r(u>k)  for  1  <  k  <  n'  and  rk  =  0  for  k  >  n'  produces  the  desired  bound. 
This  bound  verifies  the  Theorem.  I 
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2.1.3  Bound  on  N(ti(zj) 

Assume  that  an  algorithm  a  uses  a  given  number  of  sampling  locations  in  Id,  ■  Then 
Da,ij(0)  can  be  bounded  from  below  by  minimizing  the  error  bound  in  Theorem  2.1. 
There  will  exist  a  minimum  number  of  sample  locations  that  guarantees  that  DailJ(0) 
is  less  than  a  given  error  tolerance.  By  Lemma  2.2,  this  number  is  a  lower  bound  on 

A U*i). 


Lemma  2.10  If  m,  >  1  and  dt>  1,  then  the  problem 


subject  to 


is  solved  when 


min  Y l  rk 

k=  1 


Y,rk  > 

k=l 


dt+m, 


d. 


rk  =  -  •  Vfc  G  . 

The  minimum  value  of  the  objective  function  is 


Proof:  Add  the  constraint  that  rf  =  q  for  some  given  q  >  {6/ 5)d' .  By  LaGrange’s 
method  [Apo74]  and  inspection  of  the  objective  function,  we  know  that  the  minimum  of 
the  augmented  problem  must  satisfy 


_d_ 

drk 


d,  +m, 
rl 


W= 1 


+  A'E 


-i 


/=1 


=  0 


Vk 


*  ■  rf'  =  ?  • 
1=1 

The  unique  solution  to  this  condition  is 


rk  =  ( q/n ) 


\/k  . 


(2.3) 
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The  corresponding  value  of  the  objective  function  is 

q-{q/n)^  .  (2.4) 

Expression  2.4  is  minimized  by  choosing  q  =  (<5/5)d‘. 

For  this  value  of  q,  expression  2.4  is  a  lower  bound  on  the  objective  function  for 
any  feasible  set  of  {»■*}.  Since  this  lower  bound  is  achieved  for  the  rk  values  described  in 
equation  2.3,  this  must  be  the  minimum  value  for  the  objective  in  the  original  minimization 
problem.  I 

The  following  lemma  is  a  direct  consequence  of  Lemma  2.10  and  Theorem  2.1,  where 
B(xm\6)  satisfies  the  usual  assumptions. 

Lemma  2.11  If  N(^{z3)  ^  0,  d,  >  0,  and  n  >  0,  then 

/i  \d,+m,  /  A  \m> 

2-C{-  min  |^j(z;,x)|*  min  Mi(x)-i-)  ■  ( — r-  ]  •  6d' 

i€fl(x.,6)'  n  v  xeB(i.,6 )  \5 )  Vnt/ 

is  a  lower  bound  on  Da>itj( 0)  for  any  algorithm  a  €  A  with  n  sampling  locations  in 

B(x.;6). 

Note  that  this  lower  bound  comes  from  setting  rjt  =  |  •  6-n~l/d'  for  k  E  {1, . . . ,  n}.  This 
is  not  achieveable  using  only  n  sample  locations,  but  it  is  approximated  by  equidistributing 
the  sample  locations.  In  this  case, 

8  1 

rk  «  —  ss  - 

nd,  h 

where  h  is  the  distance  between  sample  locations. 

Lemma  2.11  is  sufficient  to  bound  Ntii(zj). 

Theorem  2.2  Let  B(x*\8)  C  Id,  be  any  closed  ball  in  which  'I 't(zj,x)  is  nonzero  and 
continuous  as  a  function  of  x.  If 


then 


C{-  min  x)|  •  min  M,(x)  •  8m,+d'  >  e 

ze8(z.,6)  3'  n  z e8(z.,s) 


C\  ■  min  1^,(2,,  x) )  •  min  M,(x)  ■  8m'+d' 
.  x£B(x.,6)'  1 


N(,t{zj)  > 

where  C\  is  a  positive  constant  dependent  only  on  m,,  dt,  and  ||  •  ||(,). 
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Proof:  By  Lemmas  2.4  and  2.2,  if 


Ci  •  min  |'I';(z,,x)|  •  min  M,(x)  ■  Sm,+d'  >  t  , 

z€B(x.,6)  x£B(x,,6) 


then  at  least  one  sampling  location  is  required  in  B(x\8)  in  order  to  satisfy  the  error 
tolerance.  Let  C-  =  C,  •  (l/5)m,+d’.  By  Lemma  2.11,  a  necessary  condition  for  an 
algorithm  a  to  satisfy  Do,jj(0)  <  2  •  e  is  that 


C[  •  min  |'&j(z;,x)|  •  min 

r6fl(x.,6)  ±eB(x.,S) 


M,(x)-  -r  < 


or 


C\  ■  min  I'l'jlz,, x)|  •  min  A/,(x)  •  8m,+i‘ 

xeB(x.,s)  xeB(x.,s) 


0* 


(2.5) 


where  n  is  the  number  of  sample  locations  in  B(x.;6).  Since  the  total  number  of  sample 
locations  is  greater  than  or  equal  to  n,  a  £  A°ti(zj)  unless  inequality  2.5  holds.  Therefore, 


c; 


min  |^,(2j,x)| 

x€B(x.,S) 


min  Ml{x)-8m'+d' 

X^B(Xm,S)  , 


The  proof  of  the  Theorem  follows  immediately  from  Lemma  2.2.  I 

Thus,  for  most  problems  satisfying  our  assumptions,  N(ii(z)  grows  at  least  linearly  as  a 

function  of  e~d'^m'  when  e  — ♦  0.  Note  that  the  condition 


Ci  •  min  •  min  Af,(x)  •  8m,+d'  >  t 

xeB(x,,S)  x€B(x.,6) 


always  holds  if 


C[  •  min  |'P,(zJ,x)|-  min  M,(x)  •  8m,+d' 
£  x6  B(x.,S) 


h-  A.  d- 

771  ■  /l\m.  / 1  \  m,(m,  +  dt) 


> 


(j): 


This  follows  from  the  fact  that  C-  =  C,  •5_(m,+d’).  So  the  condition  need  only  be  checked 
if 


\C[  ■  min  !'P1(zJ,x)|  •  min  A/,(x)  •  6m,+d'  j 

y  x€B{x,.6)  £^B{x.,6)  J 


d'  d. 

1 


<  1  . 


in 


which  case  the  lower  bound  on  ;Y,.,(z)  in  Theorem  2.2  is  either  0  or  1. 
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2.1.4  Generalizations 

The  lower  bound  in  Theorem  2.2  can  be  very  weak.  In  this  section  we  discuss  how  to 
calculate  a  tighter  bound. 

The  construction  of  T  in  the  previous  sections  holds  in  any  B(xm;8)  satisfying  the 
usual  conditions.  In  particular,  if  the  error  function  is  constructed  in  each  member  of  a 
set  of  nonoverlapping  closed  balls,  then  the  lower  bound  on  the  error  is  the  sum  of  the 
errors  from  each  ball. 

For  fixed  i,  let  C(x;<5)  be  the  open  d,  dimensional  cube  centered  on  x  with  volume 
8di .  Let  C(x.\8)  be  its  closure.  Let  8U  =  2~u  for  some  positive  integer  v.  If  i  >  0,  divide 
Id,  into  8~di  equal  sized  cubes.  Let  X  =  {x/}  be  centers  of  the  cubes.  Let  na>/  be  the 
number  of  sampling  locations  in  C{xc,8u)  associated  with  an  algorithm  a.  Construct  a 
new  error  function  T'  in  the  following  way. 

a)  Let  X'  C  X  be  centers  of  closed  cubes  over  which  tyj(zj,x)  is  nonzero  and  contin¬ 
uous. 

b)  If  »fj(zj,Xf)  is  positive  on  C{xi\8u),  then  use  the  previous  construction  to  define 
a  function  T  in  the  open  ball  B{xr,8v/2).  Call  this  function  T,.  If  tf,(zj,xj)  <  0, 
construct  the  same  function  T,  but  Jet  T/  =  — T.  Note  that  is  a  member  of 
Ga,,tJ(0)  in  either  case. 

c)  Define  T'  to  be 

r'  =  E  r,  . 

rt€X' 

By  the  same  reasoning  used  in  Lemma  2.8,  T'  6  Go,,,j(0).  Thus,  a  new  lower  bound  on 

Da,iAb)/2  is 


C'l  ■  8 


d,+m. 


min  |'f,(2,-,x)| 
x€fl(x,,W 2)'  V  ;l 


min  MAx) 
x€fl(i|,W2) 


min{n  ,1} 


(2.6) 


where  C"  =  C'i/2m,+d'.  Denote  expression  2.6  by  err To  bound  N(,,{z3)  for  this 
construction,  we  need  to  solve  the  optimization  problem 


min 


IT  n«,i 

x,ex> 
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subject  to 


err a,i,i/,L{zj)  —  c  • 


(2.7) 


The  optimal  sampling  locations  for  the  original  construction  were  essentially  equidis- 
tributed.  This  construction  recognizes  that  data  from  different  regions  of  the  domain  may 
have  differing  degrees  of  influence  on  the  solution,  and  thus  require  different  amounts  of 
data  sampling.  This  lower  bound  is  a  function  of  the  number  of  subcubes  the  original 
cube  is  divided  into,  2I/.  But  the  solution  to  problem  2.7  is  a  lower  bound  on  Nc{(z:) 
for  any  positive  integer  u.  Thus,  the  lower  bound  can  be  tightened  by  minimizing  the 
solution  for  fixed  v  over  the  the  set  of  all  possible  v. 


2.1.5  Summary  of  Lower  Bound  Results 

The  assumptions  on  the  data  functions  described  in  Sections  1.3.3  and  1.3.4  are  appro¬ 
priate  for  an  a  priori  error  analysis  of  a  method  whose  error  is,  at  least  partially,  a  function 
of  the  values  of  the  m,th  order  partial  derivatives  of  the  data  functions.  If  ^(z^x)  and 
Mi{x)  are  both  nonzero  over  an  open  set  in  the  domain,  then  the  lower  bounds  derived 

dt 

here  imply  that  the  data  sampling  will  increase  at  least  linearly  as  a  function  of  e  when 
the  error  tolerance  decreases.  Note  that  this  bound  is  on  the  number  of  data  samples 
required  when  approximating  a  single  solution  value,  and  not  just  when  approximating  the 
entire  solution  function.  The  construction  of  the  error  function  described  here  also  allows 
us  to  localize  the  contribution  to  the  error  when  sampling  from  any  particular  subregion. 


2.2  Upper  Bound 

Upper  bounds  on  Ntii(zj)  can  be  calculated  from  Na^{zj)  for  any  particular  algorithm 
a  €  Ae({zj}).  In  particular,  for  each  t,  let  g{  be  some  approximation  to  the  data  function 
gi  calculated  from  values  of  g,  taken  at  a  finite  number  sampling  locations  in  Id>.  Define 
the  approximation  u(z;)  to  be  the  sum  of  the  components  {uj(z;)},  where  each  u,(zj)  is 
defined  by 

«.■(*»)=  /  Vi(zj,x)gi{x)dx  . 

Jt*> 
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Then  this  construction  corresponds  to  an  algorithm  a  £  At({zj})  if 

M*j)  -«i(2j)l  ^  «/* 

for  all  t.  We  will  call  such  an  algorithm  a  Green’s  function  method  since  it  corresponds 
to  using  the  Green’s  function,  and  associated  kernels,  directly.  Since 

M*i)  -  «.(*;)!  =  f  x)  ■  (9i(x)  -  <7,(*))  dx 

\JI“. 

<  I  \^i{zj,x)\-\gi{x)-gi{x)\dx  ,  (2.8) 

Jt*> 

the  solution  error  can  be  bounded  by  a  function  of  the  data  approximation  error. 

Most  Green's  function  methods  based  on  piecewise  polynomial  approximations  to  the 
data  will  satisfy  the  error  criterion  if  enough  equidistributed  data  sampling  locations  are 
used  when  calculating  each  component.  To  verify  this,  we  describe  a  family  of  Green  s 
function  methods  based  on  a  simple  piecewise  polynomial  approximation,  and  calculate 
an  upper  bound  on  the  error  for  each  member  of  this  family.  We  then  use  this  error  bound 
to  calculate  an  upper  bound  on  Ntj(zj).  Most  commonly  used  algorithms,  both  Green’s 
function  methods  and  others,  will  have  similar  results. 

2.2.1  Simple  Bounds 

Consider  a  d,  dimensional  cube  C(x.;<5).  Let  h  =  <5/(m,  +  1).  Let 

6 

X(s,k)  ~  (^»)a  ^  ‘f’ 

for  1  <  s  <  di,  where  ( xm)a  is  the  sth  component  of  x..  Let  ir,  be  the  following  set  of 
equally  spaced  locations  in  [(x,),  -  <5/2,  (x.)4  +  <5/2], 

{^(a.l))  •  •  •  i  •*•(», ro,)}  • 

Define  rr(d,)  to  be  the  following  set  of  rnf  locations  in  C(x„;<5). 

ir{di)  =  7Ti  X  •  •  ■  X  7TJ, 

=  {x|(x),  €  JT,}  - 


(2.9) 
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Lemma  2.12  Let  g  be  a  function  defined  in  a  di  dimensional  cube  C(x.;8).  Define 
7r(dt)  to  be  the  equidistributed  locations  in  C(x*;8)  described  by  equation  2.9.  Then 
there  exists  a  unique  polynomial  g(x)  of  degree  at  most  rrii  —  1  in  each  variable  that 
interpolates  g  at  the  locations  in  w(di).  If  g  has  m{th  order  continous  partial  deriva¬ 
tives  in  C(x.\8),  then  the  error  in  the  approximation  is  bounded  from  above  by  a  term 
of  the  form 


C{  ■  6T 


max  max 

xeC(x.;6 )  £€A'(m,) 


d>  ,-i 

Wg{x) 


where  Ci  is  independent  of  8  and  xm. 


This  is  a  variation  of  the  results  in  Chapter  5  of  Prenter  [Pre75].  See  Appendix  C  for  a 
discussion  of  the  proof. 

Let  gi  be  some  element  of  G,.  Let  C{x.\8)  be  some  dt  dimensional  cube  in  Idx.  By 
Lemma  2.12,  there  exists  a  unique  polynomial  that  interpolates  the  function  p,(x)  at  the 
locations  n (di)  described  above.  Call  this  function  pi(x).  By  the  equivalence  of  finite 
dimensional  vector  norms  [Atk 78,  page  414J,  there  exists  a  constant  C[  such  that 


Ci  •  max 

n€K(mi) 


d“ 


for  all  x  £  Id,,  where  C,  is  defined  in  Lemma  2.12.  Therefore,  the  error  bound  on  the 
approximation  described  in  Lemma  2.12  becomes 


|^,(x)  —  p,(x)|  <  C\  •  8m'  ■  max  M,(x)  (2.10) 

xgC(r.;i) 

when  x  €  C(xm;  8). 

Define  an  algorithm  a  in  the  following  way.  For  each  i  >  0,  approximate  Ui(zj)  in  the 
following  way. 

a)  Divide  Id,  into  cubes  of  the  form  C(i(;  2'"°  • ),  for  some  positive  integer  i/aj. 


b)  In  each  subcube  C(x/;  2~"<M ),  let  pa,i,/(x)  be  the  approximation  to  gfix)  defined  in 
Lemma  2.12.  Define  gattj  to  be  pa,,,/  as  a  function  of  x  e  Id, ,  pa,tii(x)  =  pa,,t/(x). 


c)  Approximate  ufizj)  by  the  following  expression, 

u*,x{zj)  =  <lti{zj,x)gij{x)dx 

,  Jed,. 2-“’) 
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A  simple  upper  bound  on  the  error  in  this  approximation  to  ufizj)  follows  directly  from 
equations  2.8  and  2.10. 

Lemma  2.13 

I  u, ■(*,•)  -  iia.i(-Zj)  |  <  Cl  ■  2-*'0''m*  •  max  Affix)  •  f  \Wi(zjt  x)\dx  . 

x6k  J/d, 


If  i  >  0,  then  the  approximation  requires  2l'a•''d,  subcubes,  and  a  total  of  mf  ■  2Vayd' 
sample  locations.  The  upper  bound  on  Nffizj)  generated  by  algorithms  of  this  type  is 
the  following. 


Theorem  2.3  For  all  i, 

d  A. 

for  a  finite  constant  C"(2j )  independent  of  Mi  and  e. 

Proof:  Consider  an  algorithm  a  satisfying  the  conditions  on  page  50,  where  cubes  of 
the  form  C(xr,  2-l/<“)  are  used  to  subdivide  when  approximating  the  ith  component. 
Therefore, 

NaAzj)  =  mf  •  .  (2.11) 

From  Lemma  2.13,  a  sufficient  condition  to  satisfy  the  error  criterion  is 


C1 . 2~Ua  '  m'  .  max 
*e/d, 


Mt(x)  J  \^fizrx)\dx  <  j 


for  all  i.  Therefore, 


•  C[  ■  max  Affix)  f  |'I',(z,,  x)|  dx 
_ Jit, _ 


A 


V 


<  2^  >  d< 


for  all  i  is  also  a  sufficient  condition.  Define  a  so  that 


—  •  log. 
m. 


( l  -  C\-  max  Mt(x)  f  |'£,(r;,  x)|  dx 

£€/d,  Jta 


+  1 


(2.12) 


(2.13) 
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for  all  i.  Then  the  bounds  in  inequality  2.12  are  satisfied,  and  a  €  At({zj}).  Since 
Na,i{Z])  >  N('i(zj)  when  a  6  Ac({zj}),  the  Theorem  holds  for 

dt 

Cf(zy)  =  2*-mf  •  (‘C'-f  l*.fex)Uxj"’  . 

This  follows  directly  from  equations  2.11  and  2.13.  Since  ^(z^x)  €  this  con¬ 

stant  is  finite.  I 

As  defined  in  Section  1.3.7,  f(x)  =  0(<7(x))  if  there  exist  positive  constants  c\,  C2, 
and  x0  such  that 

ci|flf(x)|  <  |/(x)|  <  c2\g(x)\ 

for  all  x  >  xo  [HS78] .  Using  this  notation,  Theorems  2.2  and  2.3  imply  the  following 
corollary. 

Corollary  2.4  If  there  exists  a  closed  ball  B(x.‘,6)  C  Id,  on  which  Mt(x)  is  nonzero, 
and  on  which  i(zj,x )  is  nonzero  and  continuous,  then 

N,Aii)  =  ®  (<"') 

as  a  function  of  1/e. 

Proof:  If  di  =  0,  then  Theorem  2.2  states  that  Ntj(zj)  =  1  for  these  assumptions,  and 
the  statement  holds.  If  d,  >  0,  then  Theorem  2.2  bounds  N^t(z})  from  below  by 

dt 

>  ci(*j)  •  (“)  » 


where 

Cl(2j)  = 

Theorem  2.3  bounds  Nl?i(zj )  from  above  by 


C:  •  min  |'P,(2,,x)|  ■  min  M,(x)  •  6 

xGB(x.,6) 


m,+d, 


N, 


dt 

Azj)  ^  C*(Z])  ■  (“)  '  +  1  » 


dt 

c2(zj)  =  C"(zj)  ■  max  M,m'  (x)  . 

*€Ut 


where 
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Let  c'2(zj)  =  c2(zj)  +  1.  Then 

dt 

<  N,A*i)  <  4ft)  ( i) 

when  1/e  >  1.  This  proves  the  corollary.  I 


2.2.2  Improved  Bounds 

The  bounds  in  the  previous  lemma  and  theorem  can  be  improved  by  allowing  the  sampling 
frequency  to  vary  over  the  domain.  For  example,  define  a  Green’s  function  method  a 
in  the  following  way.  Divide  the  domain  I d,  into  cubes  of  the  form  C(x(;  2~v)  as  in 
Section  2.1.4.  Let  X  be  the  set  of  centers  of  these  cubes.  Divide  each  of  these  cubes 
into  cubes  of  the  form  C(xp;  2~‘,at)  where  va,i  >  v.  Use  the  polynomial  approximation 
from  Section  2.2.1  in  each  of  these  smaller  cubes  to  define  an  approximation  to  <7,  in  Id,. 
Call  this  approximation  gt.  By  Lemma  2.12  and  inequality  2.10,  an  upper  bound  on  the 
error  in  using  <1 \(z^x)gi{x)dx  to  approximate  u,(zj)  is 


C' 


-VI  mi 


max 


*€<?(*,  ;2-‘')n/<li. 


Mi(x)  ■  J  \^i{zj,x)\dx 

n/d, 


(2.14) 


Denote  this  expression  by  erra 

For  this  construction,  the  number  of  sample  locations  in  C(i;;  2_1/)  is 

mf  •  . 


The  total  number  of  sample  locations  is 

mf  ■  2^-^'  . 

x,  ex 

To  minimize  the  total  number  of  sampling  locations  in  7d|  while  still  preserving  the  error 
bound,  solve  the  problem 

min  mf  ■  £  2(u,-‘/)d' 

ii€X 
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subject  to 

err0(t>,t/(2j)  <  j  . 

The  minimization  is  taken  over  all  legal  combinations  of  the  {vt}  values.  The  solution  to 
this  problem  indicates  how  many  subcubes  to  place  in  each  of  the  larger  cubes.  This  is 
equivalent  to  specifying  the  number  of  sample  locations  to  place  in  the  larger  cubes. 

Using  equation  2.14,  we  can  close  a  gap  left  in  the  lower  bound  analysis.  We  have 
no  bounds  on  Ncj(zj)  for  the  case  when  M,(x)  •  S' ,(zy,x )  =  0  for  all  x  G  Id,,  but 
neither  function  is  identically  zero.  If  ui  =  v  for  all  /  in  equation  2.14,  then  an  immediate 
consequence  of  this  construction  is  the  following  theorem. 

Theorem  2.5  For  any  i,  assume  that  either  Mi{x)  —  0  or  ^(zy,  x)  =  0  in  each 
subregion  C(xr,2~1')  fl  Id,,  xi  G  X.  Then 

Nt,i(zj)<m?-2l/d'  . 


Proof:  Pick  some  e  >  0.  Let  i'  be  some  element  of  {0, ... ,  /}.  Define  a  to  bean  algorithm 
that  approximates  u(z_,)  in  the  following  way.  Approximate  u;<(Zj)  in  the  fashion  described 
above,  using  =  v.  Approximate  the  other  components  in  the  fashion  described  on 
page  50,  where  i/a,,  satisfies  equation  2.13  for  all  i  ^  i1.  Thus,  | u,(z;)  —  u0tl(zj)|  < 
e/l  when  i  ^  i'.  Assume  that,  for  the  given  subdivision  of  Id, ,  either  Mt(x)  =  0  or 
^.(zj,!)  =  0  in  each  subregion  C(x/;2-i/)  fl  Id,,  xi  €  X.  Then 

Wi'(Zj)  ~  «a,.'(^)l  <  *rra,i',v,u(Zj)  =  0  . 

Therefore  a  G  At({zj})  and 

Na,i'{zj)  =  mf  ■  2u'd’  . 

Since  Nail‘{zj)  is  an  upper  bound  on  Nty{zj),  and  since  both  t  and  i'  were  arbitrary,  this 
proves  the  Theorem.  I 

Therefore,  N(<i{z})  is  a  bounded  function  of  e  if  the  assumption  in  Theorem  2.5  holds  for 
any  finite  u. 
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2.2.3  Summary  of  Upper  Bound  Results 

Upper  bounds  on  the  amount  of  data  needed  when  using  simple  piecewise  polynomial 
approximations  to  a  data  function  have  the  same  asymptotic  behavior  as  the  lower  bound 
calculated  in  Section  2.1.  Therefore,  only  constants  can  be  improved  by  using  global  ap¬ 
proximation  methods  to  approximate  these  data  functions.  For  polynomial  approximations 
with  local  bases  (SF73],  like  the  example  provided  here,  the  contribution  to  the  error  from 
a  subregion  is  a  function  of  the  number  of  sample  locations  in  that  subregion.  This  allows 
us  to  decrease  the  upper  bound  by  adjusting  the  frequency  of  sampling  in  subregions. 


2.3  Optimal  Parallel  Algorithms 

In  this  section  we  calculate  lower  bounds  on  the  complexity  of  parallel  implementations 
of  algorithms  in  At{Z).  Assume  that  we  have  a  fully  connected  multiprocessor  with  an 
unlimited  number  of  homogeneous  processors  and  memories.  Furthermore,  assume  that 
it  has  infinitely  powerful  communication  capabilites.  That  is,  the  memory  access  and 
transmission  times  are  both  zero.  Thus,  the  communication  cost  is  always  zero  on  this 
architecture.  Consider  the  class  of  all  parallel  implementations  of  algorithms  in  Ae(Z) 
on  this  idealized  multiprocessor.  Then  the  minimum  parallel  complexity  achieved  by  this 
class  represents  a  lower  bound  on  the  parallel  cost  for  all  multiprocessors  in  the  following 
sense.  Assume  that  the  processors  in  a  given  multiprocessor  are  no  faster  than  those  in 
the  idealized  multirpocessor.  Then  any  parallel  algorithm  on  the  given  multiprocessor  can 
be  simulated  on  the  idealized  multiprocessor  with  no  increase  in  the  parallel  complexity. 
If  some  processors  are  faster,  then  the  bound  is  recovered  by  scaling  the  complexity  to 
reflect  this  difference. 

We  will  refer  to  the  set  of  parallel  algorithms  that  achieve  this  minimum  as  optimal 
parallel  algorithms.  Note  that  the  serial  algorithms  corresponding  to  optimal  parallel 
algorithms  may  not  have  good  parallel  implementations  on  a  particular  multiprocessor. 
The  finite  number  of  processors  and  the  cost  of  communication  can  change  the  behavior  of 
the  parallel  cost.  In  this  section  we  describe  what  form  this  lower  bound  takes,  and  describe 
a  class  of  simple  linear  algorithms  whose  parallel  complexity  has  the  same  asymptotic 
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behavior  as  t  — ►  0. 


2.3.1  Lower  Bound  on  Parallel  Complexity 

Lemma  2.1  provides  a  lower  bound  on  the  parallel  complexity  of  an  optimal  parallel 
algorithm.  The  following  theorem  then  follows  from  Theorem  2.2. 

Theorem  2.6  Assume  that  there  exists  some  z  €  Z  and  some  i  6  {1,...,/}  such 
that  di  ^  0  and  the  following  condition  holds.  There  exists  a  closed  ball  B(x.;  6)  C  Id, 
on  which  ^(2,x)  is  nonzero  and  continuous,  and  on  which  A/,(x)  is  nonzero.  Then 
the  parallel  complexity  of  an  optimal  parallel  algorithm  for  AC(Z)  is  bounded  from 
below  by  an  expression  of  the  form 

/(+)  ‘  (ci  +  c2  •  IoS2  (')  ) 

as  e  —*  0.  The  constants  C\  and  c2  are  both  independent  of  e,  and  c2  >  0. 


Proof:  Let 


e.  =  Ci  ■  min  I'f.-fz,-,  x)|  ■  min  M,(x)  -6 
xefl(r.,5)  3  f  €*(*..«) 


m,+d. 


By  Theorem  2.2.  if  e  <  e.,  then 

Na,i(z)  > 

for  all  a  €  A<(Z).  Therefore,  if  t  <  e..  then 


C:  ■  min  |'f,-(2,x)|*  min  M,(x)  •  8 

xeB(z.,S)  x€B(r.,6) 


m,+d, 


max  [log2  (iVa,,(2))]  >  ci  +  -f--l°g2(-)  , 

z£Z  771,  \c/ 


where 


ci  =  max  ~  •  log2  (cl  ■  min  |'P,(2,x)|-  min  M,(x)  •  6m,+d'  ]  . 

*€Z  rrii  \  xeB(x.,«)  xeB(x.,5)  ) 

By  the  assumptions  of  the  Theorem,  ci  is  finite.  The  proof  then  follows  from  Lemma  2.1. 


Corollary  2.7  Given  the  assumptions  of  Theorem  2.6,  the  parallel  complexity  of 
parallel  implementations  of  algorithms  in  A((Z)  grow  at  least  linearly  as  a  function 
of  log2(c-1)  when  e  decreases. 
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2.3.2  Upper  Bound  on  Parallel  Complexity 

In  Theorem  2.3  we  described  an  algorithm  a  €  Ae({z})  that  approximated  each  compo¬ 
nent  of  u(z )  by  an  expression  of  the  form 


Ua,i(z)  =  /  Vi(z,x  )ga,i(x)dx 

J  id 


'U 
2^0,1  <*1 


-  ?/c 


JC{ir,2~‘'a,d') 


^ i(z,x)gatij(x)dx 


(2.15) 


when  i  >  0.  Each  function  gaii,i(x)  is  the  polynomial  interpolant  of  g(x)  in  the  d,  dimen¬ 
sional  cube  C(xi;  2_l,a’ d')  described  in  Section  2.2.1. 

Let  TCi(di)  =  {xi,k}  be  the  set  of  sampling  locations  in  C(x(;  2~l/a d' ).  Let  //,jt(x)  be 
the  unique  polynomial  of  degree  at  most  m;  —  1  in  each  variable  such  that 


UAx) 


J  1 ,  if  x  =  xltk  ; 

1  0  ,  if  x  €  Tri(di) ,  x  ±  xl{k  . 


Then  ga,,,i(x)  can  be  represented 


by 

J2gi(xi,k)-  h,k{x)  . 

k=l 


See  AppendixC  and  [Pre75]  for  a  more  detailed  discussion  of  this  representation  of  the 
interpolating  polynomial. 

The  expression  for  ua,,{x)  in  equation  2.15  becomes 


2*'a,>  /  m |* 

Ua,i(z)  =  Y2  f  *i(z,x)  J^9i(xl,k)  ■  ll,k(x) 
1=  1 _ d  .  W=1 


C(x,; 


dx 


! 


2va,iaf  mt 


=  II  X ]  9i{xi,k)-  f  ' Vi{z,x)lt%k{x)dx  . 

i=i  *=i  L  d  . 

Each  factor  i(z,x)liAx)  dx  is  independent  of  the  data,  and  can  be  pre¬ 

computed.  Call  it  ratiik(z).  The  expression  for  ua.,(z)  is  then 


Ua,t{z)  =  H  ra.,,l.k{z)  ■  9 i(ll.k) 

1=  1  k=  1 


(2.16) 
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The  multiplications  in  expression  2.16  are  ah  independent,  and  the  additions  can  be 
calculated  in  parallel  as  in  the  example  on  page  10.  Therefore,  the  parallel  complexity  of 
evaluating  this  expression  can  be  as  low  as 

ft-)  +  /«■)■  [log,  (mf  •  2-  '■*)]  (2.17) 

when  mf  •  2vd'  processors  are  used.  Let 

dt 

C,(z )  =  C"(z)  •  max  M,m'  ( x )  • 

=  2’'’  •  mf  •  max  M?'  (x)  (hC[-  / 

*€/„.  V  Ju, 

where  C"(z})  is  the  constant  defined  in  Theorem  2.3  and  C\  is  the  constant  from  equa¬ 
tion  2.10  and  Lemma  2.13.  By  the  definition  of  a  in  Theorem  2.3,  we  know  that 

log,  (mf  ■  2--  A)  <  log,  fd(,)  ■  (i)  "■  +  1 

<  max  |  log,C,(z)  +  ■  log,  +  1 ,  0  |  .  (2.19) 

Since  '!/,  is  a  type  1  kernel,  C,(z)  is  a  finite  lonnegative  constant  independent  of  e. 

The  approximation  of  u(z)  requires  the  evaluation  of  an  expression  like  expression  2.16 
for  each  component.  The  final  step  of  the  approximation  is 

i 

“o(f)  =  • 

i  =  l 

The  parallel  complexity  of  evaluating  this  sum  can  be  as  low  as 

/<+)■  [log,/]  (2.20) 

if  [//2J  processors  are  used.  The  following  lemma  describes  how  small  the  parallel  com¬ 
plexity  of  algorithm  a  can  be  on  the  idealized  multiprocessor. 

Lemma  2.14  Approximate.  u(z)  to  within  an  error  tolerance  of  e  in  the  following 
way.  Use  the  algorithm  on  page  50  of  Section  2.2.1  to  approximate  each  component 
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u,(z)  to  within  an  error  tolerance  of  e/l  by  satisfying  equation  2.13.  There  exists  a 
parallel  implementation  of  this  algorithm  whose  parallel  complexity  is  bounded  from 
above  by  an  expression  of  the  form 

/(*)  +  /(+)  ‘  (c3(^)  +  c4  •  log2  (i)  ) 

when  t  <  1.  The  constants  03(2)  and  c4  are  independent  of  e,  and  c4  is  nonnegative. 
No  more  than 

(i+i)  + 

1=1 

processors  are  required  to  achieve  this  bound.  Each  C,(z)  is  a  finite  nonnegative 
constant  independent  of  e. 

Proof:  Since  the  approximation  of  each  component  is  independent  of  the  others,  the 
parallel  complexity  of  approximating  the  components  can  be  as  low  as 

/(-)  +  ^max^  ^  max  jlog2C,(z)  +  ^ •  Iog2  Q  +  1  ,  0  j  j 

-  d« 

if  [ Ci(z )  •  (c)  m-  +  lj  processors  are  used  to  calculate  each  component.  Therefore, 

processors  are  enough  for  this  step  of  the  parallel  implementation.  This  follows  from 
equations  2.17  and  2.19.  Using  expression  2.20,  the  final  summation  of  the  components 
need  only  have  a  parallel  complexity  of  /{+)  •  [log2/j  when  [// 2J  processors  are  used. 
Enough  processors  are  available  from  the  previous  step  to  satisfy  this  condition.  Thus, 
the  Lemma  holds  if 


c3(z)  =  [log2/|  +  max.  (max  {log2  C,(z)  +  1  ,  oj)  +  1  , 
*€{l,  -,0  '  ^  ’ 


and 

c4  =  max 
*€{1 . 0 

I 

The  following  theorem  is  an  immediate  consequence  of  this  lemma. 
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Theorem  2.8  Define  an  algorithm  a  £  A((Z)  in  the  following  way.  For  each  z  £  Z , 
approximate  u(z)  to  within  an  error  tolerance  of  e  using  the  algorithm  in  Lemma  2. If. 
There  exists  a  parallel  implementation  of  a  whose  parallel  complexity  is  bounded  from 
above  by  an  expression  of  the  form 

/(*)  +  /(+)  •  (c3  +  c4  •  log 2  j 

The  constants  c3  and  c4  are  both  independent  of  e,  and  c4  >  0.  Let  \Z\  be  the  number 
of  locations  in  Z .  There  exists  an  expression 

that  is  an  upper  bound  on  the  number  of  processors  required  to  achieve  this  bound  on 
the  complexity,  where  each  C,  is  a  finite  nonnegative  constant  independent  of  t. 

Proof:  The  approximation  of  each  solution  value  u(z)  for  z  £  Z  is  independent  of  the 
others.  Thus,  we  can  calculate  each  solution  value  on  a  different  subset  of  the  processors 
in  the  idealized  multiprocessor.  The  bound  follows  from  Lemma  2.14  when 

c3  =  max  c3(z)  , 

*6  Z 

C,  =  max  C{(z)  for  i  £  {1, ....  /}  , 

z£Z 

and  c4  is  the  same  as  in  Lemma  2.14.  By  assumption  2  on  page  18, 

max  /  x)|  dx  <  max  /  x)|  dx  <  oo  . 

zez  Jid'  Ji 

That  Ci  is  nonnegative  and  finite  then  follows  from  equation  2.18  and  the  assumptions 

on  Mfix).  I 

Corollary  2.9  The  parallel  complexity  of  an  optimal  parallel  algorithm  for  A((Z)  is 
bounded  from  above  by  an  expression  of  the  form 

/(.)  +  /(  +  )  '  (c3  +  •  log2  j  ■ 

The  constants  c3  and  c4  are  both  independent  of  e,  and  c4  >  0. 
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Proof:  The  algorithm  described  in  Theorem  2.8  is  a  member  of  Ae(Z).  Therefore,  the 
upper  bound  described  in  Theorem  2.8  is  also  an  upper  bound  on  the  parallel  complexity 
of  the  optimal  parallel  algorithms.  I 

Note  that  using  the  algorithm  in  Section  2.2.1  for  each  component  of  each  solution 
value  in  { u(z )  \z  6  Z]  describes  a  finite  complexity1  explicit  linear  algorithm  of  the  type 
defined  in  Section  1.3.7, 

7 

u  =  1£R.9,  ■  (2.21) 

t=i 

Assume  that  the  jth  element  of  the  vector  u  approximates  u(z:)  for  z:  €  Z.  Then,  for 
fixed  i,  the  elements  of  the  set  {r,  ,/*(£_,)}  are  the  nonzero  elements  in  the  jth  row  of  R<. 
The  following  corollary  is  an  immediate  consequence  of  this  discussion  and  Theorem  2.9. 

Corollary  2.10  There  exists  a  finite  constant  C  independent  of  t  for  which  the  fol¬ 
lowing  property  holds:  For  any  e  >  0,  there  exists  a  finite  complexity  explicit  lin¬ 
ear  algorithm  a  6  A<(Z)  with  a  parallel  implementation  whose  parallel  complexity  is 
bounded  from  above  by  an  expression  of  the  form 

/H  +  C  •/(+)■  log2  ^  . 

Only  a  finite  number  of  processors  are  required  to  achieve  this  bound. 


2.3.3  Summary  of  Complexity  Bounds 

When  the  conditions  assumed  in  Theorem  2.6  are  satisfied,  the  parallel  complexity  of  all 
parallel  algorithms  must  increase  as  the  error  tolerance  decreases.  And,  as  e  — ►  0,  there 
is  always  a  parallel  implementation  of  an  explicit  linear  algorithm  in  A((Z)  whose  parallel 
complexity  is  within  a  constant  factor  of  the  lower  bound.  The  size  of  this  constant  factor 
can  be  quite  large.  But  the  size  of  the  factor  can  be  reduced  by  using  the  generalizations 
described  in  Sections  2.1.4  and  2.2.2. 

^he  finite  complexity  of  this  algorithm  is  easily  verified  directly.  It  is  also  a  consequence  of 
Theorem  2.8.  The  serial  complexity  is  bounded  from  above  by  the  product  of  the  parallel  complexity 
and  the  number  of  processors  used.  Both  of  these  are  finite. 
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2.4  Conclusion 

In  this  chapter  we  have  described  upper  and  lower  bounds  on  iV<(, (£_,). 

•  If  ^ i(zj,x )  is  not  nonzero  over  some  open  set  contained  in  /<*,,  then  Ncj(zj)  =  0. 

•  If  N('i(zj)  ^  0  and  d,  =  0,  then  NCt,(zj)  =  1. 

•  If  there  exists  a  finite  covering  of  the  domain  by  cf,  dimensional  cubes  such  that  either 

$i(zj,x)  =  0  or  M{(x)  =  0  in  each  cube,  then  is  bounded  independent  of 

e. 

•  If  and  Af.(x)  are  both  nonzero  on  any  open  set  in  the  domain,  then 

N(,i(zj)  =  © 

as  a  function  of  1/e. 

We  have  also  described  upper  and  lower  bounds  on  the  asymptotic  behavior  of  the  parallel 
complexity  as  the  error  tolerance  decreases. 

•  If  there  exists  at  least  one  z  E  Z  and  one  i  E  {1,...,/}  such  that  ^,(z},x)  and 
Mi(x)  are  both  nonzero  on  any  open  set  in  the  domain,  then  the  parallel  complexity 
of  parallel  implementations  of  algorithms  in  AC(Z)  must  increase  at  least  linearly  as 
a  function  of 

/(  +  )  •  dx  ■  iog2  (-) 

when  e  — ♦  0. 

•  For  an  ideal  multiprocessor  with  arbitrarily  many  processors  and  infinitely  powerful 
communication  capabilities,  the  parallel  complexity  of  an  optimal  parallel  algorithm 
is  bounded  above  by  a  function  of  the  form 

/(+)•  max  (  — logjf-'))  +  c  , 

>€{i . J)  \m,  \e/  / 

where  c  is  a  finite  constant  independent  of  t. 


Chapter  3 
Scaling 


Current  research  into  parallel  processing  and  multiprocessors  is  driven  by  the  need  to 
increase  computing  power.  Two  goals  are  achieved  by  increasing  the  computing  power. 
More  problems  can  be  solved  in  a  given  interval  of  time  and  problems  whose  solutions 
have  heretofore  been  too  costly  to  calculate  can  now  be  solved. 

In  this  chapter  we  examine  an  issue  related  to  how  effective  multiprocessors  are  at 
achieving  these  goals,  scaling.  A  multiprocessor  scales  if  increasing  the  number  of  proces¬ 
sors  enables  it  to  solve  larger  problems  efficiently.  A  lack  of  parallelism  in  the  algorithms 
or  increasing  communication  cost  can  prevent  this.  We  analyze  this  issue  by  using  the 
results  of  Chapter  2  to  calculate  lower  bounds  on  the  parallel  cost  of  approximating  the 
solution  of  a  PDE  on  a  particular  multiprocessor.  We  bound  the  parallel  cost  as  the  serial 
complexity  of  calculating  this  approximation  increases.  We  also  examine  the  efFect  of 
communication  costs  on  this  bound. 


3.1  Definition  of  Scaling 

Scaling  a  multiprocessor  architecture  increases  or  decreases  the  number  of  processors 
and  memories  in  the  multiprocessor  while  keeping  certain  attributes  of  the  architecture 
fixed.1  In  particular,  we  define  scaling  for  the  example  architectures  of  Section  1.1.3  in  the 

'See  [LM87]  for  a  discussion  of  parameterized  architectures.  These  are  families  of  architectures 
whose  individual  members  are  specified  by  a  particular  choice  of  the  parameters,  like  the  number  of 
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following  way.  The  graph  of  the  scaled  multiprocessor  architecture  is  of  the  same  type  as 
before  scaling,  and  each  type  of  component  has  the  same  parameters.  For  example,  when 
scaling  a  d  dimensional  array  multiprocessor  with  Nd  processors,  the  graph  of  the  new 
multiprocessor  will  still  be  a  d  dimensional  cube.  The  graph  labels  indicating  processor 
and  communication  abilities  will  also  be  unchanged.  But  the  number  of  processors  will 
now  be  Md  for  some  M  ^  N .  Thus,  scaling  defines  a  family  of  multiprocessors  with 
similar  architectures.  We  will  refer  to  a  particular  multiprocessor  as  an  instance  of  this 
architectural  family  and  to  the  number  of  processors  in  an  instance  as  its  size.  We  will 
use  the  term  scaling  up  to  mean  increasing  the  size  of  a  multiprocessor  architecture. 

Scaling  a  problem  alters  the  problem  specifications  in  such  a  way  that  the  serial 
complexity  of  the  algorithms  used  to  solve  it  changes.  By  the  assumptions  in  Section  1.3.6, 
the  problem  of  approximating  the  PDE  is  a  function  of  the  number  and  location  of  solution 
values  to  be  approximated,  the  available  data,  and  the  bound  on  the  error  in  approximating 
the  solution  values.  Therefore,  to  scale  the  problem,  one  or  more  of  these  parameters  must 
be  changed.  Scaling  the  problem  defines  a  family  of  similar  problems,  all  approximating  the 
solution  of  the  PDE.  We  will  refer  to  a  given  set  of  specifications  as  a  problem  instance. 
The  size  of  a  problem  instance  is  the  minimum  serial  complexity  of  algorithms  that  solve 
the  problem  instance.  We  will  use  the  term  scaling  up  to  mean  changing  the  problem 
specifications  in  such  a  way  that  its  size  increases. 

We  say  that  a  multiprocessor  architecture  scales  for  a  problem  if  the  minimum  parallel 
cost  can  be  bounded  independent  of  the  problem  size  by  scaling  up  the  architecture.  That 
is,  there  exists  a  finite  constant  associated  with  the  problem  and  the  architecture  with  the 
following  property.  For  any  given  problem  instance,  there  is  always  some  instance  of  the 
architecture  that  is  large  enough  that  the  minimum  parallel  cost  is  less  than  this  constant. 
Given  a  reasonable  assumption  on  how  the  size  of  a  problem  can  grow,  we  will  show  that 
no  multiprocessor  architecture  scales  for  our  problems. 


processors.  In  general,  we  can  define  a  scaling  of  a  given  multiprocessor  by  replicating  the  original 
multiprocessor  some  number  of  times,  and  joining  these  pieces  together  in  some  consistent  way. 
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3.2  Problem  Scaling  Assumptions 

We  will  not  increase  the  size  of  the  problem  unless  some  advantage  is  gained  by  doing 
so.  On  serial  computers,  there  are  only  two  possible  reasons  for  increasing  the  size  of 
problem.  First,  we  may  be  interested  in  approximating  the  solution  more  accurately, 
jnd  are  willing  to  pay  the  price  of  a  longer  execution  time.  Second,  the  algorithm  we 
are  using  may  not  be  optimal,  and  solving  a  problem  instance  with  a  larger  size  may  take 
less  time  than  when  solving  the  original  instance.  But,  as  the  size  continues  increasing, 
the  increasing  minimum  serial  complexity  will  force  the  serial  complexity  of  all  algorithms 
to  eventually  increase.  Thus,  ignoring  the  vagaries  of  individual  algorithms,  the  most 
common  reason  for  increasing  the  size  of  the  problem  is  to  improve  the  approximation  to 
the  solution. 

On  multiprocessors,  there  is  one  additional  reason  to  increase  the  size  of  the  problem. 
It  may  be  that  a  problem  instance  with  a  larger  size  can  be  solved  with  a  smaller  parallel 
cost  than  the  original  probem  instance  could  be.  This  may  be  due  to  the  vagaries  of  a 
given  parallel  algorithm,  or  to  a  smaller  minimum  parallel  cost.  But,  for  a  fixed  error  bound 
e,  there  is  a  lower  bound  on  the  minimum  parallel  cost  that  is  independent  of  the  other 
factors,  as  was  described  in  chapters  1  and  2.  Therefore,  as  the  problem  continues  to  scale 
up,  any  decrease  in  the  parallel  cost  will  be  limited  by  the  lower  bound,  and  any  advantage 
to  scaling  up  the  problem  will  suffer  from  a  law  of  diminishing  returns.  Additionally,  a 
decrease  in  the  minimum  parallel  cost  is  unlikely  when  scaling  up  the  problem,  and  would 
be  difficult  to  predict.  In  practice,  when  a  problem  is  scaled  up,  it  is  done  in  order  to 
improve  the  approximation  to  the  solution. 

Given  the  above  discussion,  we  will  assume  that  we  are  only  interested  in  scaling  up 
a  problem  if  the  approximation  to  the  solution  is  thereby  improved.  Our  first  assumption 
specifies  how  we  will  measure  this  improvement.  This  assumption  includes  the  most 
commonly  used  techniques. 

I)  If  the  problem  is  scaled  up,  then  some  upper  bound  on  the  error  in  approximating 
the  solution  of  the  PDE  is  decreased.  We  will  assume  two  different  types  of  error 
bounds. 


a)  Let  Z  be  a  finite  set  of  locations  in  Q,  and  assume  that  the  error  bound  to 


66 


CHAPTER  3.  SCALING 


be  satisfied  is  of  the  form 

max  max  |u(z.)  —  u(z,)\  <  e  . 

u€U  Zj£Z 

For  each  Zj  G  Z  and  u  6  U ,  u(zj)  is  an  approximation  to  u{zf)  calculated 
from  the  data  specified  by  the  problem  instance. 

b)  Assume  that  the  error  bound  is  on  how  well  the  solution  function  is  ap¬ 
proximated, 

max  max  \u(z)  —  u{z)\  <  e  . 

uG U  zGO 

For  each  u  £  U ,  the  function  u  is  an  approximation  to  u  satisfying  the 
following  property.  For  each  problem  instance,  u(z)  is  a  function  of  a 
finite  set  of  values  {u(-2_,)},  where  { Zj }  is  the  ret  cf  locations  specified  by 
the  instance  and  each  u(z})  is  an  approximation  to  u(zj )  calculated  from 
the  available  data.  Furthermore,  for  each  z  6  fi,  u(z)  is  a  function  of  at 
most  c  of  these  approximate  solution  values,  where  c  is  independent  of  the 
size  of  the  problem  and  z. 

For  case  b,  u  might  be  a  piecewise  polynomial  representation  of  the  function  based  on  the 
set  of  values  specified  by  the  problem  instance.  The  assumption  about  how  ii(z)  is  defined 
is  to  ensure  that  approximating  the  solution  at  the  locations  specified  by  the  problem 
instance  represents  the  major  component  of  the  cost  of  approximating  the  function. 

We  also  want  the  increase  in  the  size  to  be  effective  in  decreasing  the  error.  This 
motivates  the  second  assumption  about  how  the  problem  scales. 

2)  There  exists  a  monotonically  increasing  continuous  function  r\  with  the  following 
properties. 

a)  T](x)  is  finite  and  positive  for  finite  x  >  0,  and  tj( 0)  =  0. 

b)  Let  s(p)  be  the  size  of  a  problem  instance  p.  Let  ep  be  the  error  bound 
satisfied  by  the  instance.  Then, 
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This  insures  that,  in  the  limit  as  the  size  of  the  problem  goes  to  infinity,  the  error  in  the 
approximation  goes  to  zero. 

Unless  the  solution  functions  in  U  are  at  least  continuous,  or  have  some  special  struc¬ 
ture,  the  error  bound  in  case  b  of  assumption  1  cannot  be  made  arbitrarily  small.  Thus, 
by  assumption  2,  the  size  of  the  problems  we  are  interested  in  solving  will  be  bounded 
from  above.  If  we  assume  that  all  functions  in  U  are  at  least  continuous,  then  this  upper 
bound  no  longer  exists.  We  prove  this  in  the  following  lemma. 

Lemma  3.1  Assume  that  all  u  6  U  are  continuous  on  the  compact  domain  Pi.  Then, 
for  each  t  >  0,  there  exists  a  problem  instance  of  finite  size  whose  solution  approxi¬ 
mates  the  solution  of  the  PDE  to  within  this  error  tolerance  for  all  u  €  U . 

Proof:  Assume  that  u  €  U .  Since  PI  is  compact,  u  is  uniformly  continuous  on  this 
domain.  Therefore,  there  exists  a  8  such  that 

|u(2)  -  u{z')\  <  e/2 

for  all  z,z'  e  PI  for  which  \\z  —  z'\\i  <  8.  And,  since  Pi  is  compact,  there  exists  a  finite 
set  of  locations  Zt  C  Pi  such  that 

Q  C  U  B&;6)  ■  (3.1) 

Zj  £Ze 

Therefore,  only  a  finite  number  of  solution  values  are  required  to  approximate  the  solution 
anywhere  in  the  domain  to  within  an  error  tolerance  of  e/2. 

By  corollary  2.10,  we  know  that  there  exists  a  finite  complexity  serial  algorithm  a 
that  approximates  the  solution  at  each  location  in  Ze  to  within  an  error  tolerance  of  e/2. 
Denote  the  approximation  to  u(zj)  by  ua(zj).  Thus,  for  any  2  E  0,  there  exists  some 
Zj  €  Zt  such  that 

\u(z)  -  Ua(zj)\  <  \u(z)  -  u(Zj)\  -(-  !u(Zj)  -  ua{z})\ 

<  e/2  +  e/2  . 

This  follows  from  equation  3.1.  Therefore,  there  exists  an  algorithm  with  a  finite  serial 
complexity  whose  solution  approximates  the  solution  of  the  PDE  to  within  an  error  bound 
of  e.  This  proves  the  Lemma  for  both  cases  a  and  b  of  assumption  1.  I 
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The  complexity  of  the  algorithm  used  to  prove  corollary  2.10  was  bounded  by  making 
some  simplifying  assumptions.  For  each  z3  G  Zt<  the  algorithm  a  has  the  form 

J  •£va,x  +  pt)  dt  m*' 

ro,i,/,fc('2j)  •  gi{Xijjt)  . 

»=1  1=1  k=l 

We  assumed  that  the  cost  of  calculating  the  coefficients  {raiji(iJk(£j)}  is  zero.  But 
Lemma  3.1  will  hold  as  long  as  the  cost  of  approximating  each  coefficient  to  within 
an  error  tolerance  of  e  is  finite  for  each  e  >  0.  In  general,  any  of  the  traditional  finite  dif¬ 
ference  and  finite  element  techniques  can  be  used  to  generate  finite  complexity  algorithms 
that  satisfy  the  same  error  criterion. 


3.3  Architecture  Independent  Bounds 

The  next  theorem  essentially  says  the  following.  If  the  problem  cannot  be  solved  exactly 
using  a  finite  amount  of  information,  then  the  parallel  complexity  must  increase  as  the 
problem  size  grows. 

Theorem  3.1  Assume  that  there  exists  a  z,  G  Q  and  some  i  G  {1 such  that 
d ,  ^  0  and  the  following  property  holds:  There  exists  a  closed  ball  B{x,\6)  C  Id, 
on  which  is  nonzero  and  continuous,  and  on  which  A/,(x)  is  nonzero.  Also 

assume  that  /(+j  is  bounded  away  from  zero  for  all  permissible  multiprocessors  archi¬ 
tectures.  Then  the  following  conditions  hold  for  case  b  of  assumption  l,  and  for  case 
a  when  z .  G  Z . 

•  As  the  problem  size  increases,  a  lower  bound  on  the  parallel  complexity  increases. 

•  In  the  limit  as  the  problem  size  becomes  infinite,  the  lower  bound  on  the  parallel 
complexity  also  becomes  infinite. 


Proof:  By  assumptions  1  and  2  on  scaling,  as  the  size  of  the  problem  grows,  the  error  in 
approximating  the  solution  decreases.  And  the  error  bound  goes  to  zero  as  the  problem 
size  becomes  infinite. 
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For  case  a  of  assumption  1,  the  solution  is  approximated  at  some  fixed  set  of  solution 
locations  Z.  The  error  bound  is  on  the  error  in  these  approximations.  If  z.  G  Z,  then 
Theorem  2.6  says  that  the  parallel  complexity  must  at  least  grow  like 

/(+)  •  (C1  +  C2  '  log2  Q)  ) 

for  constants  cx  and  c2.  As  t  decreases,  this  bound  grows.  And,  as  e  — +  0,  this  bound 
becomes  infinite. 

For  case  6,  the  error  in  approximating  the  entire  solution  function  is  being  bounded. 
Assume  that  the  error  bound  for  a  specific  problem  instance  is  e.  Then 


|u(z„)  -  u(z„)|  <  t  . 


A  problem  instance  specifies  that  a  given  finite  number  of  solution  values  be  approximated, 
and  these  are  used  to  define  the  function  u(z).  Let  c  be  an  upper  bound  on  the  number 
of  solution  values  used  to  form  the  approximation  u(z,).  By  assumption,  c  is  a  constant 
independent  of  the  problem  size  and  e.  Let  Z.  =  {z>,.}  be  the  set  of  solution  locations 
used  when  approximating  u(zm).  Let  N(zj ,.)  be  the  amount  of  data  used  to  calculate 
u(zj^).  Then  u(z.)  is  a  function  of  at  least 


data. 


By  corollary  2.4,  there  exists  a  nonzero  constant  C  such  that  C  ■  is  a  lower 

bound  on  the  amount  of  data  required  to  approximate  u(z.)  to  within  an  error  tolerance 
of  e.  Therefore, 

£  N(zh.)  >  C-e~d'/m'  , 


and 


I- Gr 

for  at  least  one  zh.  G  Z„.  By  Lemma  1.4,  the  parallel  complexity  of  any  algorithm 
satisfying  this  error  bound  on  the  approximation  to  the  solution  will  be  bounded  from 
below  by 

/(+)  •  ^  •  log2  +  log2  C  -  log2  cj  . 
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As  before,  this  bound  grows  as  e  decreases,  and  the  bound  becomes  infinite  as  e  goes  to 
zero.  This  proves  the  Theorem.  I 

Since  the  parallel  complexity  is  a  lower  bound  on  the  parallel  cost,  the  following 
corollary  is  an  immediate  consequence  of  Theorem  3.1. 

Corollary  3.2  When  the  assumptions  of  Theorem  3.1  are  satisfied,  the  parallel  cost 
of  approximating  a  PDE  increases  unboundedly  as  the  problem  size  increases  unbound¬ 
edly,  independent  of  the  size  of  the  multiprocessor  architecture. 


3.4  Effect  of  Communication  Costs 

3.4.1  Computation  Bound  Algorithms 

By  corollary  3.2,  we  usually  can’t  bound  the  parallel  cost  as  the  size  of  the  problem 
increases.  But  a  good  mult*processor  architecture  won’t  exacerbate  this  increase.  In  this 
section  we  examine  how  the  communication  capabilities  of  a  multiprocessor  affect  the 
parallel  cost  as  the  size  of  both  the  architecture  and  the  problem  increase. 

Using  the  notation  of  Section  1.2.1,  the  parallel  cost  Tp  of  a  parallel  algorithm  is 
bounded  by 

max  {CP,WV}  <  Tp  <  Cp  +  Wp  . 

Cp  is  the  parallel  complexity,  and  Wp  is  the  communication  cost.  If  the  algorithm  is 
computation  bound,  then  <  Cp,  and  the  parallel  cost  is  at  most  twice  as  much  as  it 
would  be  on  a  multiprocessor  with  arbitrarily  powerful  communication  capabilities.  If  the 
communication  and  the  computation  overlap,  then  the  parallel  cost  can  be  as  low  as  Cp. 

As  in  Section  1.2.2,  let  Pa<m  be  the  maximum  number  of  processors  that  collaborate 
in  the  calculation  of  a  single  solution  value  for  a  parallel  algorithm  a.  For  a  given  multi¬ 
processor,  let  r(p)  be  the  radius  of  the  subset  of  p  processors  with  minimum  radius.  Note 
that  r(p)  is  a  nondecreasing  function  of  p.  By  Lemma  1.3,  r(Pa ,)  is  a  lower  bound  on  Wp 
for  algorithm  a.  For  the  rest  of  this  section,  we  will  describe  conditions  on  r(p)  that  de¬ 
termine  whether  parallel  algorithms  on  a  multiprocessor  architecture  can  be  computation 
bound. 
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3.4.2  Sufficient  Conditions  on  r(p) 

Let  Na<.  be  the  maximum  amount  of  data  used  by  algorithm  a  when  calculating  a  single 
solution  value.  Assume  that  we  have  an  upper  bound  on  the  serial  complexity  of  computing 
a  single  solution  value  of  the  form2 

/(+)  •  Ca(Na,.)  . 

Assume  further  that  this  representation  of  the  upper  bound  as  a  function  of  iVa>.  is 
common  to  some  set  of  serial  algorithms. 

Example:  To  calculate  the  solution  using  an  implicit  linear  method,  we 
solve  a  matrix  equation  Ax  =  b.  A  is  a  nonsingular  N  x  N  matrix,  where 
N  <  Na<m.  A  simple  upper  bound  on  the  serial  complexity  of  solving  this 
linear  system  using  Gaussian  elimination  is 

(/(+)  +  /(.)  +  /(/))  *  W3  • 

See  Atkinson  [Atk78,  pages  441-443].  Using  this, 

C3(Na ,.)  <  (l  +  /(,/^')~L/))  ’  • 

In  contrast,  the  explicit  linear  algorithm  described  in  Section  2.3.2  has  an 
upper  bound  of 

(/(+)  +  /(•))  •  Na<m 

on  the  serial  complexity  of  computing  a  single  solution  value.  For  this  algo¬ 
rithm, 

C,(Na,.)  <  (l  +  j^J  ■  Na,.  . 

Until  now  we  have  concentrated  on  the  cost  of  computing  binary  floating  point  op¬ 
erations,  since  they  determine  how  the  data  dependence  efFects  the  complexity.  For  this 
analysis,  we  must  also  take  into  account  the  one  unary  operator  we  have  assumed,  nega¬ 
tion.  It  can  be  included  in  our  analysis  easily  enough  given  one  simple  assumption.  We 

2Note  the  change  in  notation.  In  Section  1.2.2,  Ca(Na,.)  referred  to  a  lower  bound  on  the  serial 
complexity. 
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assume  that  no  algorithm  in  the  given  set  calculates  unnecessary  unary  negations,  i.e., 
that  all  expressions  of  the  form  —(  —  (/))  have  been  reduced  to  /.  Calculating  unneces¬ 
sary  unary  negations  can  only  add  to  both  the  serial  and  parallel  complexities,  and  has  no 
redeeming  properties.  Given  this  assumption,  the  following  lemma  holds. 

Lemma  3.2  All  parallel  implementations  of  algorithms  in  the  given  set  satisfy 

Pa.-  <3-Cs(JVa,.)  +  1  . 

Proof:  The  expression  /(+)  •  Cs(Na:.)  is  an  upper  bound  on  the  serial  complexity  of 
using  algorithm  a  to  compute  a  single  solution  value.  Since  floating  point  addition  is  the 
cheapest  binary  floating  point  operation,  at  most  Cs(7Va,„)  binary  floating  point  operations 
are  required  to  calculate  a  single  solution  value.  Therefore,  there  are  at  most  Cs(Aa,„) 
intermediate  results  that  might  be  negated.  The  iVa,  data  may  also  be  negated  in  the 
algorithm.  By  Lemma  1.2,  Na <  C,(iVa,.)  +  1.  Therefore,  at  most 


Ca(Na,.)  +  Na.m  <  2-C,(Na,.)  +  l 


unary  negations  will  be  calculated,  and  at  most  3  •  C3{Nai.)  +  1  floating  point  operations 
are  required  by  the  algorithm. 

If  a  processor  is  collaborating  in  the  computation  of  a  particular  solution  value,  then 
it  will  compute  a  floating  point  operation  whose  result  is  required  by  that  value.  Since 
there  are  no  more  than  3  ■  C,{Na_.)  +  1  of  these  values,  there  can  be  no  more  than  this 
many  collaborating  processors.  I 

Let  P  be  the  number  of  processors  in  a  given  multiprocessor.  By  Lemma  1.4, 


Cp  >  /(+)  •  max 


Cs(Na,.) 

Pa.. 


,  fl°g2  Na,.} 


Therefore,  if 


for  all 


r(p)  <  /(+)  •  max  \  ^  .  log2  X,,. 


(3.2) 


pe  {1 . min{3-r,(A',,.)+  1  .  P)} 
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then  this  aspect  of  the  communication  cost  will  not  prevent  parallel  implementations  of 
algorithms  in  the  set  from  being  computation  bound.  This  follows  from  Lemmas  1.4 
and  3.2.  Since  r(p )  is  a  nondecreasing  function  of  p,  and  since  the  right  hand  side  of 
inequality  3.2  is  a  nonincreasing  function  of  p,  it  is  sufficient  that  r(p)  satisfy 

r(min{3C,(^Va,.)  +  l,P})  <  f{+)  •  log2  Na,.  .  (3.3) 

Theorem  3.3  Assume  that  C,(Nat,)  <  a  ■  Nf „  for  some  set  of  serial  algorithms, 
where  oo  >  7  >  1.  Assume  that  no  algorithm  in  this  set  calculates  unnecessary  unary 
negations.  Assume  that 

r(p)  <  0-  pog2pl 

on  a  given  multiprocessor,  where 

0  <  - ^ -  • 

2  +  log2(3a)  +  7 

Then  r(p)  will  not  prevent  any  parallel  implementation  of  algorithms  in  this  set  from 
being  computation  bound  on  this  multiprocessor. 

Proof:  Again,  let  P  be  the  number  of  processors  in  the  multiprocessor.  Assume  that 
r(p)  =  0\og2p,  and  that  3  •  C,(Nat.)  +  1  <  P.  Then 

r(3-CJ(iVa>.)  +  l)  <  r(3a  •  Nf.  +  1 ) 

=  /3.[logl(3a.,V;.+  l)| 

<  0  ■  (log2(3a  •  ,\;j  +  2) 

=  0  ■  (7  '  Iog2  ^a.*  +  !og2(3a)  +  2)  . 

If  iVa .  =  1,  then  Pa .  =  1  and  r{Pa  . )  =  0  for  any  0.  Assume  that  Na >  2.  Therefore, 
Iog2  i'V0i.  >  1.  By  inequality  3.3,  0  must  satisfy 

0  '  (7  •  log2  Na,.  +  log2(3o)  +  2)  </(  +  )-  log2  ,Va,.  . 

This  is  equivalent  to 

<  is*** .(hti.  . 

7  \  0  I 
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and  a  sufficient  condition  is 

3  <  - -  . 

2  +  log2(3a)  +  7 

If  P  <  3  •  C.{Na,.)  +  1,  then 

•  log2  P  <  j3  ■  log2  (3  ■  C3(Na>m)  +  1 )  . 

Therefore,  r(P)  <  /(+)  •  log2  Na<.  for  the  same  value  of  ft.  I 

This  condition  does  not  imply  that  parallel  implementations  of  all  such  serial  algo¬ 
rithms  will  be  computation  bound.  It  does  not  even  indicate  that  there  exist  any  parallel 
implementations  that  are  computation  bound.  It  simply  says  that  the  fact  that  p  proces¬ 
sors  are  collaborating  in  a  computation  does  not  prevent  a  parallel  implementation  from 
being  computation  bound. 

Note  that  a  polynomial  type  bound  on  Ca(Nai.)  will  exist  for  all  reasonable  parallel  al¬ 
gorithms,  independent  of  the  problem  size.  Most  traditional  methods  generate  algorithms 
for  which  C,{Na ,„)  is  less  than  an  expression  of  the  form  a  ■  N%<m  for  all  problem  sizes. 
And,  by  corollary  2.10,  we  can  get  within  a  constant  factor  of  the  lower  bound  on  the 
parallel  complexity  by  using  parallel  algorithms  for  which  a  is  finite  and  7=1. 

Corollary  3.4  For  fixed  a  and  7,  there  is  a  finite  transmission  time  t  dependent  only 
on  a  and  7  such  that  the  following  property  holds:  If  a  multiprocessor  is  a  member  of 
one  of  the  following  architectural  families, 

•  Fully  Connected 

•  Star 

•  Distributed  Shared  Memory 

•  Hypercube 

and  if  each  communication  channel  has  a  transmission  time  of  t,  then  r(p)  satisfies 
the  condition  in  Theorem  3.1.  In  these  cases ,  r(p)  grows  slowly  enough  that  no  set  of 
parallel  algorithms  satisfying  ( ',( ,'Va  „ )  <  a  ■  Nf .  is  prevented  from  being  computation 
bound  by  the  mere  fact  that  p  processors  are  collaborating  in  a  computation .  This  is 
independent  of  the  multiproce ssor  and  problem  sizes. 
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Proof:  For  all' of  these  architectural  families,  r(p )  <  t  •  log2p.  See  Section  1.1.3  for  the 
discussion  of  this  fact.  Therefore,  r(p)  is  sufficiently  small  if  t  <  0,  where  (3  is  given  in 
Theorem  3.3.  I 

3.4.3  Necessary  Conditions  on  r(p ) 

Consider  a  multiprocessor  with  P  processors  and  a  given  problem  instance.  Let  A(p)  be 
the  set  of  all  parallel  algorithms  that  solve  the  problem  instance  on  this  multiprocessor, 
and  that  use  no  more  than  p  processors  to  calculate  a  single  solution  value.  That  is,  if 
a  €  A(p),  then  Pa%.  <  p.  For  an  algorithm  a,  let  Cp(a )  be  the  parallel  complexity  of 
algorithm  a.  For  each  p.  let 

CP(P)  =  min  Cp(a)  . 

ae-4(p) 

Since  A(px)  C  A(p2)  when  pi  <  p2 ,  Cp(pi)  >  Cp(p2).  Therefore,  Cp{P )  is  the  minimum 
parallel  complexity  for  algorithms  that  solve  this  problem  instance  on  this  multiprocessor. 

Lemma  3.3  A  lower  bound  on  the  parallel  cost  in  solving  the  problem  instance  on 
this  multiprocessor  is 

J^nmmax{r(p)’  ^p(P)}  • 

P€{  1 . P} 


Proof:  Assume  that 

min  max  {r(p) ,  Cp(p)} 
pe{i . p'} 

is  a  lower  bound  on  the  parallel  cost  for  all  algorithms  in  A(p'),  but  assume  that 


(3.4) 


min  max  (r(p) ,  Cp(p)}  (3.5) 

P€{1 . p'+l} 

is  not  a  lower  bound  on  the  parallel  cost  for  all  algorithms  in  A(p'  +1).  In  particular,  let 
a  be  some  algorithm  in  A{p'  4-  1)  whose  parallel  cost  is  less  than  expression  3.5. 

By  Lemma  1.3,  max  {r(p'  +  1) ,  Cv{p'  +  1)}  is  a  lower  bound  on  the  parallel  cost  of  all 
algorithms  a  €  A{P)  for  which  ,Va .  =  p'+  1.  Therefore,  a  6  A(p').  But,  by  assumption, 
the  parallel  cost  of  a  is  no  smaller  than  expression  3.4,  which  is  itself  no  smaller  than 
expression  3.5.  This  is  a  contradiction  Therefore,  expression  3  5  is  a  lower  bound  on  the 
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parallel  complexity  of  all  algorithms  in  A(p'  +  1).  Since  r(l)  =  0,  expression  3.4  is  always 
a  lower  bound  when  p'  =  1.  The  Lemma  then  follows  from  an  induction  argument  on  p'. 

I 

By  the  results  in  the  previous  section,  we  know  that  a  logarithmic  bound  on  r(p) 
is  usually  sufficient  to  permit  computation  bound  parallel  algorithms  to  exist  when  p 
processors  are  collaborating.  For  now,  assume  that  r(p)  >  /3  •  pM  for  this  multiprocessor. 
For  example,  if  the  multiprocessor  is  a  d  dimensional  array,  then  r(p)  >  (dtj 2)  •  (p1^  —  1). 
If  0  =  dt/ 4  and  p  =  1/d,  then  r(p)  >  f3  ■  p*  when  p 4  >  2. 

Let  e  be  the  error  tolerance  specified  by  the  problem  instance.  Let  Z  be  the  set  of 
solution  values  specified  by  the  problem  instance.  Let 

I 

Ne,m  =  max  ^2  Ntii(z)  . 

,=i 

Lemma  3.4  If  r(p)  >  (3  ■  pM,  then  a  lower  bound  on  the  parallel  cost  in  solving  the 
problem  instance  on  this  multiprocessor  is 

0*  •(/<+)•  w,.  - 1))*  . 


Proof:  By  definition,  N(<m  is  a  lower  bound  on  Na<m  for  all  a  6  A(P).  Thus,  by  Lemma  1.4, 


/(+) 


N(,.  -  1 


is  a  lower  bound  on  Cp(p)  for  all  p.  If  r(p)  >  0  ■  pM,  then  the  bound  in  Lemma  3  3  is 
itself  bounded  from  below  by 

1 


min  max  1/3  ■  pM,  /(+) 

P€{1 . oo}  f 


Nt,m- 

P 


One  term  in  this  expression  is  monotonically  increasing  as  a  function  of  p,  and  one  is 
monotonically  decreasing.  The  minimum  occurs  at  the  value  of  p  for  which  the  two  terms 
are  equal, 

p=  (^ •  w.. -  l ) 
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Setting  p  equal  to  this  value  in  either  of  the  two  terms  gives  the  lower  bound  in  the 
statement  of  the  Lemma.  I 

If  P  is  large  enough,  then  the  Green’s  function  method  described  in  Section  2.3.2  can 
be  used  to  describe  an  upper  bound  on  Cp(p )  for  all  p. 

Lemma  3.5  Let  Ngfm>m  be  the  maximum,  amount  of  data  used  by  the  algorithm  de¬ 
scribed  in  Section  2.3.2  when  calrulating  a  single  solution  value.  If  P  >  p  ■  \Z\,  then 

CM  <  (/<.)  +  /(♦))•  +  /w-IWl 

for  p  <  Ngfm  .. 

Proof:  For  each  solution  value,  the  Green's  function  method  can  be  represented  by  the 


expression 


where 


5Z  Yi  X)  ri.iAzj)  ■  9i(*i,k)  , 

1=1  1=1  k=l 


The  multiplications  ar^  all  independent,  and  p  processors  can  execute  them  in  time  less 
than  or  equal  to 

r  ,o 


once  each  processor  has  received  the  relevant  data.  The  next  stage  is  the  summation 
of  these  products.  Partition  the  summands  into  p  subsets  of  approximately  equal  size, 
and  distribute  them  among  the  processors.  Summing  the  elements  of  each  subset  is 
independent,  and  each  processor  can  calculate  the  sum  of  its  subset  in  time  no  greater 
than 

f  ,o  -v 


This  leaves  at  most  p  partial  results  to  be  summed.  As  in  the  example  on  page  10,  this 
final  summation  can  be  scheduled  so  that  the  parallel  complexity  is  at  most 


/(+)  •  iP 


CI.S) 
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If  at  most  p  processors  collaborate  in  the  calculation  of  a  single  solution  value,  then 
this  algorithm  is  a  member  of  A(p).  Thus,  the  sum  of  expressions  3.6,  3.7,  and  3.8  is  an 
upper  bound  on  Cp(p)  for  each  p.  Note  that  at  most  Ngfm,  floating  point  operations 
can  be  calculated  in  parallel  during  the  computation  of  a  solution  value.  Therefore,  this 
upper  bound  is  only  interesting  for  p  —  Ngfm„.  I 

We  will  refer  to  the  parallel  algorithms  in  A(P)  that  have  the  minimum  parallel  cost 
as  minimum  cost  parallel  algorithms.  Note  that  the  optimal  parallel  algorithms  in  Chap¬ 
ter  2  were  minimum  cost  parallel  algorithms  for  the  ideal  multiprocessor  described  there. 
Lemma  3.4  describes  a  lower  bound  on  the  parallel  cost.  Lemma  3.5  describes  an  upper 
bound  on  Cp(P)  when  P  is  large  enough.  If  we  parallelize  the  Green's  function  method 
by  equidistributing  the  processors  among  the  different  solution  value  computations,  then 
at  least  [-P/I^U  processors  will  be  collaborating.  Let  P'  =  \_P/\Z\\.  By  Lemmas  3.4 
and  3.5,  if 


/?*■(/(«•  w,- 1))* 


>  2 '  «<&•)  I  <*•> +  M 


N, 


gfm,* 


+  /(+)  •  n°g2p]  j  » (3-9) 


then  the  minimum  parallel  cost  is  more  than  twice  the  minimum  parallel  complexity.  In 
this  case,  either  the  minimum  cost  parallel  algorithm  is  not  computation  bound,  or  the 
minimum  cost  parallel  algorithm  does  not  have  the  minimum  parallel  complexity.  In  both 
situations,  the  communication  cost  has  become  the  dominant  factor  in  determii.ing  the 
behavior  of  minimum  cost  parallel  algorithms. 


Theorem  3.5  A  ssume  that  there  exists  a  z.  £  fi  and  some  s  £  {1,.  .  . such  that 
d,  ^  0  and  the  following  property  holds:  There  exists  a  closed  ball  B(x.\8)  C  Id,  on 
which  is  nonzero  and  continuous,  and  on  which  Ms(x)  is  nonzero.  Assume 

that  0  and  p  are  positive  constants,  and  that  r(p)  >  0  ■  pM  for  all  scalings  of  a  given 
multiprocessor.  Then  the  following  conditions  hold  for  case  b  of  assumption  1,  and 
for  case  a  when  z.  £  Z . 

1)  There  exists  a  problem  instance  and  a  multiprocessor  instance  such  that  the 
minimum  parallrl  cost  is  greater  than  twice  the  minimum  parallel  complexity. 
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This  condition  holds  for  all  scalings  of  the  multiprocessor  larger  than  the  given 
instance. 

2)  Let  R(p)  be  the  ratio  of  the  minimum  parallel  cost  to  the  minimum  parallel 
complexity  on  a  multiprocessor  instance  of  size  p.  In  the  limit  as  the  size  of  the 
problem  goes  to  infinity,  maxp>i  R(p)  also  goes  to  infinity. 


Proof:  Consider  case  a  of  assumption  1  first,  and  assume  that  z.  £  Z .  By  Theorem  2.2, 
there  exists  a  constant  C!  such  that 

N(,.  >  N(ti(zm)  >  ci  ■  . 

By  Lemma  3.4,  a  lower  bound  on  the  minimum  parallel  cost  is 

0^  •  (j+  •  (ci  •  -  1))M+1  • 

This  is  bounded  from  below  by 


when  Cie~m>  >  2. 

Let  Ngfmi{z)  be  the  amount  of  zth  component  data  used  by  the  Green’s  function 
method  from  Section  2.3.2  to  approximate  u(z).  For  this  algorithm,  the  error  in  approxi¬ 
mating  the  zth  component  is  less  than  c/(/  +  1).  By  Theorem  2.3, 


f  l  { 

Ngfm,.  <  max  ]T  Ngfnht(z)  <  max£  (  cw(i)  ■  (-)  m' 


for  functions  C2,,(c)  that  are  bounded  on  0.  Therefore,  there  exists  a  finite  constant  C2 
such  that 


when  t  <  1.  If  P  =  \Z\  ■  Ngfnh,,  then 


CP(P)  —  Cp(Ngfm,.)  —  (/(*)  +  /(+))  T  /(+)  ’  1°S2(  •'''gfm,*) 
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This  follows  from  Lemma  3.5.  Therefore, 


Cp{P)  <  (/(.,  +  /(+))  +  /(+)•(  log2  c2  +  max  — -log2e  1  ]  . 

The  lower  bound  on  the  parallel  cost  grows  linearly  in  as  e  — ►  0,  while  the 

upper  bound  on  CP(\Z\  •  Ngfm .)  only  grows  logarithmically  in  c"1.  Thus,  there  exists 
some  e.  such  that  the  parallel  cost  is  more  than  twice  CP(\Z\  ■  Ngfm„)  when  e  <  e„. 
As  e  — ♦  0,  the  ratio  of  the  minimum  parallel  cost  to  the  minimum  parallel  complexity  is 
bounded  from  below  by 

d. 


£  m, '^+1 

c3  '  _  1 

log2e 

for  some  positive  constant  c3.  This  can  be  made  arbitrarily  large  for  small  enough  e.  This 
has  two  implications. 

First,  for  small  e,  the  minimum  parallel  cost  will  be  larger  than  twice  the  minimum 
parallel  complexity  even  when  fewer  than  \Z\-Ngfm<m  processors  are  available.  For  example, 
the  upper  bound  on  Cp{Ngfm  ,/ 2)  is  at  most  double  the  bound  on  Cp(Ngfm  „)  for  a  fixed 
e.  In  general,  for  fixed  e  <  e„,  this  analysis  can  be  used  to  calculate  a  P(  such  that  the 
following  property  holds:  If  the  size  of  the  multiprocessor  instance  is  greater  than  P(,  then 
the  minimum  parallel  cost  is  more  than  twice  the  minimum  parallel  complexity. 

Second,  let  R(p )  be  the  ratio  of  the  minimum  parallel  cost  to  the  minimum  parallel 
complexity  on  a  multiprocessor  instance  of  size  p.  Since 


maxfl(p)  >  R{\Z\  ■  Ngfa'.)  , 

we  know  that  maxp>i  R{p)  — >  oo  when  e  — >  0. 

This  proves  the  Theorem  for  case  a.  Case  b  is  exactly  the  same,  except  that  the  lower 
bound  on  Nt<,  is  now  a  factor  of  1/c  smaller.  This  bound  is  derived  in  exactly  the  same 
way  as  in  Theorem  3.1  I 

In  summary,  if  r(p)  >  (3  ■  p1*  for  any  positive  and  p,  then  the  communication  costs 
will  eventually  affect  the  achievable  parallel  performance.  This  condition  holds  for  some 
common  architectures. 

Corollary  3.6  Assume  that  the  graph  of  a  multiprocessor  is  a  d  dimensional  array, 
and  that  the  transmission  time  t  is  bounded  away  from  zero.  Assume  that  the  PDE 
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satisfies  the  conditions  in  Theorem  3.5.  Then,  for  large  problem  sizes,  the  commu¬ 
nication  cost  determines  the  minimum  parallel  cost  as  the  multiprocessor  is  scaled 
up. 

Proof:  As  before,  r(p)  >  ( dt/2 )  •  (p1^  —  1)  for  a  d  dimensional  array.  Therefore,  if 
pi/d  >  2,  then  r(p)  >  (dt/ty-p1^.  The  result  then  follows  immediately  from  Theorem  3.5. 

I 

As  we  mentioned  in  Chapter  1,  the  communication  capabilities  of  large  multiprocessors 
are  constrained  by  the  three  dimensionality  of  the  physical  world,  and  by  the  speed  of  light. 
All  multiprocessors  can  be  modelled  by  a  3-dimensional  array  of  processors  with  some  finite 
transmission  speed.  This  motivates  the  next  corollary. 

Corollary  3.7  Assume  that  each  processor  is  a  cube  with  a  fixed  finite  volume.  As¬ 
sume  that  the  PDE  satisfies  the  conditions  in  Theorem  3.5.  Then,  for  large  problem 
sizes,  the  communication  cost  determines  the  minimum  parallel  cost  as  the  multipro¬ 
cessor  is  scaled  up. 

Proof:  Assume  that  the  volume  of  a  processor  is  v  in  some  standard  unit.  Then  a  set 
of  p  processors  will  take  up  a  volume  of  at  least  pv,  and  cover  a  region  whose  maximum 
width  is  at  least  (pu)1/3.  Any  message  between  processors  must  travel  from  a  surface  of 
the  sending  processor  to  a  surface  of  the  receiving  processor.  Let  the  physical  diameter 
of  the  set  be  the  maximum  distance  between  two  surfaces  of  processors  in  the  set.  Then 
the  physical  diameter  is  greater  than 

{pv)1'3  -2v1'3  =  (p-  8),/3V'3  . 

As  in  Section  1.1.2,  let  the  center  processor  be  the  one  that  minimizes  the  maximum 
distance  between  itself  and  all  of  the  others,  where  distance  is  now  measured  between 
closest  surfaces.  Let  the  physical  radius  be  the  maximum  distance  between  the  center 
and  the  other  processors.  Then  half  the  physical  diameter,  minus  half  the  width  of  the 
center  processor,  is  a  lower  bound  on  the  physical  radius,  i.e. 

fp  —  8)1/3  •  —  v1/3 

•> 
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Since  no  message  can  travel  faster  than  the  speed  of  light,  the  radius  of  this  set  of 
processors  is  at  least 

((p-8)1/3— l)  V'3 

2c  ’ 

where  c  is  the  speed  of  light  in  these  units.  If  0  =  u1/,3/4c  and  p  >  64,  then  r(p)  > 
0  ■  plC,  The  result  then  follows  from  Theorem  3.5.  I 

Using  current  technologies,  the  speed  of  light  is  not  the  only  restriction  on  transmission 
speed,  and  r(p)  >  0  ■  p1/3  for  0  that  is  much  larger  than  that  calculated  in  this  corollary. 

3.4.4  Examples 

The  results  of  Section  3.4.3  described  some  effects  of  r(p)  on  the  parallel  cost  of  minimum 
cost  parallel  algorithms.  Similar  results  can  also  be  derived  for  other  algorithms  with  similar 
properties.  Assume  that  the  serial  complexity  of  computing  a  single  solution  value  u{z,) 
using  an  algorithm  a  is 

«•/(+)•*£.  ■ 

Here,  Na<.  is  the  maximum  amount  of  data  used  in  the  calculation  of  a  single  solution 
value.  For  example,  a  might  be  based  on  an  implicit  linear  method  or  a  Green’s  function 
method,  as  in  the  examples  on  page  71.  For  the  Green’s  function  method,  7  =  1.  For  an 
implicit  linear  method,  7  will  normally  be  found  in  the  interval  [1,3]. 

A  lower  bound  on  the  parallel  cost  of  a  parallel  implementation  of  a  is 

max  jr(p) ,  - — - —  ,/(+)•  log2  Na<m  j  (3.10) 

when  p  processors  collaborate  in  approximating  u{z.).  The  first  term  is  a  lower  bound  on 
the  communication  cost,  and  the  second  and  third  terms  are  lower  bounds  on  the  parallel 
complexity.  For  the  ideal  multiprocessor  of  Chapter  2,  r(p)  =  0.  For  a  hypercube  based 
architecture,  r(p)  >  t  ■  log2p.  For  a  d  dimensional  array,  r(p)  >  (dt/ 2)  ■  ( p —  1).  Note 
that  this  lower  bound  on  the  parallel  cost  is  the  same  for  the  ideal  multiprocessor  and  the 
hypercube  when  /  =  /(+). 

Assume  that  t  =  /(+),  a  =  1,  and  Na,.  =  1000.  Figure  3.1  on  page  83  and  3.2  on 
page  84  are  graphs  of  the  lower  bound  on  the  parallel  cost  for  a  variety  of  architectures 
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Hypercube  (—),  1-D  (•  •),  2-D  (•••),  3-D  (- - -) 

Figure  3.1:  Lower  bound  on  parallel  cost  as  a  function  of  the  number  of  collaborating 
processors  when  jVa  ,  =  1000  and  7=1. 

and  two  values  of  7.  Each  figure  is  a  graph  of  the  the  lower  bound  as  a  function  of  p, 
where  the  bound  is  expressed  in  units  of  /(+).  The  solid  curve  represents  a  hypercube 
based  architecture,  the  dashed  curve  represents  a  1-dimensional  array,  the  dot-dash  curve 
represents  a  2-dimensional  array,  and  the  dotted  curve  represents  a  3-dimensional  array. 
Figure  3.1  assumes  that  7  =  1,  while  Figure  3.2  assumes  that  7  =  2.  From  Lemma  3.2, 
we  know  that  the  maximum  number  of  processors  that  can  be  used  to  calculate  u(z .)  is 
bounded  from  above  by  3  •  N^m  +  1.  This  limit  bounds  the  range  of  number  of  processors 
used  in  the  two  graphs. 

In  both  figures  the  component  of  the  lower  bound  representing  the  communication 
cost  is  the  dominant  term  for  the  d.  dimensional  arrays  when  p  is  large.  And,  after  this  term 
begins  to  dominate,  the  lower  bound  is  an  increasing  function  of  p.  Different  assumptions 
about  a  and  t  will  change  these  curves,  but  the  general  behavior  of  the  curves  will  be  the 
same  if  Na is  large  enough. 

The  exact  behavior  of  the  lower  bound  is  not  useful  information  unless  the  lower  bound 
on  the  parallel  complexity  is  a  fairly  good  estimate  of  the  true  parallel  complexity.  Assume 
that  the  number  of  processors  that  collaborate  in  th*>  approximation  of  u{z.)  is  equal  to 
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Hypercube  ( — ),  1-D  (•  •),  2-D  (•••),  3-D  (- - -) 

Figure  3.2:  Lower  bound  on  parallel  cost  as  a  function  of  the  number  of  collaborating 
processors  when  Na<m  =  1000  and  7  =  2. 


the  maximum  number  of  collaborating  processors,  P0  „.  Let  Pw  be  the  smallest  number 
of  collaborating  processors  for  which  the  communication  term  is  the  dominant  term  in  the 
lower  bound  defined  in  expression  3.10.  Assume  that  the  parallel  complexity  of  algorithm 
a  decreases  monotonically  as  Pa,.  increases,  and  that 


« •  /(+)  •  K. 


a,* 


(3.11) 


is  a  good  estimate  of  the  true  parallel  complexity  when  Pa,.  <  Pw.  Since  the  parallel 
complexity  will  not  increase  as  Pa,.  increases,  and  since  the  lower  bound  on  the  commu¬ 
nication  cost  does,  the  parallel  algorithm  will  not  be  computation  bound  if  Pa>,  >  Pw. 
These  conditions  can  be  made  to  hold  for  the  Green’s  function  method  and  minimum 
cost  parallel  algorithms,  and  this  fact  led  to  the  proof  of  Theorem  3.5.  Most  commonly 
used  algorithms  for  the  approximation  of  linear  PDEs  have  a  great  deal  of  parallelism,  and 
parallel  implementations  will  exist  for  which  expression  3.11  is  a  good  approximation  to 
the  parallel  complexity  for  a  wide  range  of  numbers  of  collaborating  processors. 

Note  that  Pw  is  fairly  large,  and  that  this  aspect  of  the  communication  cost  will  not 
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dominate  until  Pw  processors  are  used.  In  particular, 

PW  =  Q  (aC,^) 

for  a  d  dimensional  array.  Thus,  for  a  1-dimensional  array,  Pw  =  0  ^Af02,.^  when  7  =  1. 

For  a  2-dimensional  array,  .P,,,  =  0  when  7=1.  For  moderate  sized  multi¬ 

processors  and  large  problems,  r(p)  will  not  be  a  limiting  feature.  Other  aspects  of 
communication  cost  may  play  a  role  though. 

3.5  Conclusion 

3.5.1  Summary 

Any  nontrivial  problem  instance  will  have  a  nonzero  lower  bound  on  the  parallel  cost  that 
is  a  function  of  the  basic  processor  speed.  For  example,  if  more  than  one  datum  is  required 
to  calculate  some  desired  solution  value,  then  Lemma  1.4  describes  such  a  nonzero  lower 
bound.  Therefore,  an  infinite  number  of  processors  is  unable  to  solve  a  specific  problem 
instance  arbitrarily  fast  unless  the  processor  speed  also  becomes  infinite.  In  this  chapter 
we  have  shown  that  an  infinite  number  of  processors  of  bounded  speed  are  also  unable  to 
bound  the  calculation  time  as  the  problem  grows  in  size.  This  is  not  dependent  on  the 
multiprocessor  architecture  or  the  parallel  algorithm. 

Define  r(p)  to  be  the  minimum  radius  of  p  processor  subsets  of  a  multiprocessor 
architecture.  Then  r(p)  is  a  lower  bound  on  the  communication  cost  incurred  when  p 
processors  collaborate  in  the  calculation  of  a  solution  value.  To  prevent  the  multiprocessor 
architecture  from  degrading  performance  any  further  than  already  indicated,  the  growth  in 
r(p )  as  a  function  of  p  must  be  bounded.  The  bound  on  the  growth  of  r(p)  is  a  function 
of  the  amount  of  parallelism  in  the  algorithm  used,  but  logarithmic  growth  is  a  sufficient 
bound.  For  minimum  cost  parallel  algorithms,  a  necessary  condition  is  that  the  bound  on 
r(p)  grow  slower  than  any  positive  power  of  p.  This  is  not  achievable  in  multiprocessors 
large  enough  to  be  constrained  by  the  physical  limitations  of  the  three  dimensional  world. 
Therefore,  the  communication  cost  constrains  the  minimum  parallel  cost  of  a  problem 
when  its  size  becomes  large. 
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3.5.2  Generalizations 

The  results  of  this  chapter  are  a  consequence  of  the  following  fact.  The  amount  of  data 
required  to  calculate  a  single  solution  value  increases  unboundedly  as  the  size  of  the 
problem  increases.  The  assumptions  of  Sections  1.3  and  3.2  are  sufficient  to  establish 
this  fact,  but  are  by  no  means  necessary.  We  must  simply  show  that  the  error  in  the 
individual  solution  values  must  go  to  zero  in  order  for  the  error  in  the  approximation  to 
go  to  zero,  and  that  at  least  one  solution  value  cannot  be  calculated  exactly  without  an 
infinite  amount  of  data.  This  is  a  common  situation,  and  these  results  will  hold  for  other 
reasonable  assumptions  as  well. 

If  we  restrict  the  class  of  algorithms  more  tightly,  these  results  can  be  sharpened. 
In  particular,  more  detailed  descriptions  of  the  effects  of  communication  costs  can  be 
derived.  For  example,  see  Gannon  and  Van  Rosendale  [GV84].  We  used  the  Green's 
function  method  to  establish  an  upper  bound  on  the  minimum  parallel  complexity.  This 
analysis  can  also  be  used  to  bound  the  parallel  cost  of  parallel  implementations  of  a 
Green’s  function  method.  For  large  numbers  of  processors,  its  minimum  parallel  cost  is 
also  constrained  by  the  communication  cost,  unless  r(p )  only  grows  logarithmically  in  p. 
The  behavior  of  the  Green's  function  method's  parallel  complexity  is  similar  to  that  of 
more  common  methods,  like  multigrid  (HT82]  and  particle  methods  [GR86],  and  the  same 
type  of  analysis  will  carry  over  immediately. 
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Chapter  4 

Example  Bounds 


In  this  chapter  we  calculate  upper  and  lower  bounds  on  Ntj(z)  for  some  simple  example 
problems.  The  bounds  derived  here  will  be  somewhat  tighter  than  those  described  in 
Chapter  2  since  the  generality  of  Theorems  2.2  and  2.3  is  not  necessary.  We  then  use 
these  results  to  discuss  bounds  on  the  parallel  complexity  and  the  parallel  cost.  We  also 
discuss  an  example  problem  that  does  not  satisfy  some  of  the  assumptions  in  Section  1.3, 
and  describe  how  the  analysis  can  be  generalized  for  this  case. 


4.1  1-D  Elliptic  Example 


Consider  the  problem 

«(O)=02,  U(l)  =  03 

on  the  interval  [0, 1],  for  some  constants  g2  and  g3.  The  integral  representation  of  the 
solution  is 

u(z)  =  f  tfi {z,x)gi{x)dx  +  02-(l-z)  +  g3  •  *  , 

Jo 


where 


(1  -  z)  ■  X  if  X  <  z  , 
(1  -x)-z  if  z  <  i  . 
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There  are  three  components, 

ui(z)  =  [  'J'l (z,x)gx(x)dx  , 

Jo 

M*)  =  <72’  (1  -  Z)  , 

and 

«3 (z)  =  g3-  z  . 

Thus,  dx  =  1  and  d2  =  d3  =  0. 

Let  m  be  a  positive  integer.  Assume  that  gx  is  some  member  of  a  set  Gx  that  is 
characterized  by  the  condition  that 

dm 

^TS(x)  <  m!  Vx  G  [0, 1] 

if  g  €  G\.  For  example,  xm  €  G\.  Assume  that  g2  and  <73  are  only  known  to  be  real 
numbers.  That  is,  G2  =  G3  =  3?  by  the  notation  of  Section  1.3.3. 

4.1.1  Lower  Bound  on  5) 

Assume  that  .5  €  Z.  That  is,  we  are  interested  in  approximating  u(.5).  Since  d2  =  0  and 
g2  is  not  known  to  be  bounded,  the  intrinsic  error  in  approximating  u 2(.5)  is  zero  if  one 
data  sample  is  used  and  infinite  if  none  are  used.  Therefore,  N(i2{.5)  =  1  if  e  is  finite. 
For  the  same  reasons,  Nt,z(.5)  =  1  if  e  is  finite. 

The  graph  of  ^i(.5,x)  as  a  function  of  x  is  displayed  in  Figure  4.1  on  page  89.  The 
function  ^j(.5,x)  is  continuous  and  positive  in  any  interval  of  the  form  [a,  1  —  a]  when 
a  6  (0, .5).  Let 

e(a)  =  C]  •  min  i ^ j f .5.  x)|  •  m!  •  (.5  —  a)m+1 

x€[<*,l-ar! 

=  C.-|-m!.(.5-a)m+1  , 

where  C\  is  the  constant  from  Lemma  2.5.  By  Theorem  2.2,  if  c  <  e(a),  then 
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Figure  4.1:  ^a(.5,x)  for  —  uxx  =  /  given  Dirichlet  boundary  conditions. 


By  the  proof  of  Theorem  2.2,  C{  =  (1/ 5)m+1  •  C\.  By  the  proof  of  Lemma  2.5  in 
Appendix  A, 

Therefore,  inequality  4.1  becomes 


(4.2) 


Inequality  4.2  can  be  improved.  The  construction  of  T  on  page  39  was  unnecessarily 
conservative  for  the  case  d\  =  1.  A  function  7  of  the  type  defined  in  Lemma  2.5  can  be 
placed  between  every  two  sample  locations  in  the  interval  [a,. 5  —  a].  This  defines  a  new 
error  function  T  whose  support  essentially  covers  all  of  [a,  .5  —  a].  For  this  construction 
C\  —  Ci,  and  the  new  lower  bound  is 


This  bound  is  maximized  when  a  =  1/(2 m  +  4).  The  resulting  lower  bound  on  ;Vttl(.5) 
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when  £  <  e(l/(2 m  -f  4)). 

We  can  show  directly  that 


N\, i(.5)  >  [m/2] 


if  £  >  0.  First,  assume  that  there  are  m!  sampling  locations,  {x*},  where  m'  <  fm/2]. 
Let  p2fn/(x)  be  a  polynomial  of  the  form 

m' 

p2m'{x)  =  K  •  IJ(x  -  xk)2  . 

fc= 1 

Then  (dm/dxm)  p2m<(x)  =  0,  and  p2m<  is  zero  at  all  sampling  locations.  If  m  >  0,  then 
gx  +P2m'  €  G i  and  gx  +p2m'  is  indistinguishable  from  pi  at  the  sampling  locations.  Unlike 
the  discussion  in  Section  1.3.6,  the  data  function  gx  -f  p2m'  need  not  be  compatible  with 
any  of  the  other  data  functions.  Therefore,  the  intrinsic  error  associated  with  a  given  set 
of  sampling  locations  is  at  least  as  large  as 

Jjf  'J'1(.5,x)p2m>(x)dx  =  K  j0^  i(-5,x)  ^n(x  ~  xfc)2j  <fx  • 

Since  ^i(.5,x)  is  positive  in  (0,1),  this  lower  bound  on  the  intrinsic  error  can  be  made 
arbitrarily  large  by  making  \K\  arbitrarily  large.  Therefore,  the  intrinsic  error  is  infinite  if 
fewer  than  [m/2]  sampling  locations  are  used.  This  proves  inequality  4.4. 

Inequalities  4.3  and  4.4  together  describe  a  lower  bound  on  N(tX(.5)  for  all  e  >  0. 
We  will  refer  to  this  bound  by  N$(. 5).  Figures  4.2  to  4.5  on  pages  91  and  92  contain 
graphs  of  this  lower  bound  as  a  function  of  t  for  m  =  1,2,3,  and  4,  respectively.  The 
constant  multiplying  the  cm  term  in  inequality  4.3  is  relatively  small,  but  Nc<x(.5)  is  still 
guaranteed  to  be  larger  than  1000  when  c  is  less  than  the  values  in  the  following  table. 


1  £  o  H 

4.63  x  10'6  2.19  x  10"9  9.56  x  10~13  4.16  x  10~16 


upper  bound  ( - )  lower  bound  (•  •  -) 

Figure  4.2:  Bounds  on  Nt ,i(-5)  as  a  function  of  e  for  example  1-D  elliptic  problem 
when  m  =  1. 
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Figure  4.3:  Bounds  on  iVe  l(. 5)  as  a  function  of  e  for  example  1-D  elliptic  problem 
when  m  =  2. 
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Figure  4.4:  Bounds  on  Nc ,i(.5)  as  a  function  of  e  for  example  1-D  elliptic  problem 
when  m  =  3. 
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Figure  4.5:  Bounds  on  iVe  l(. 5)  as  a  function  of  e  for  example  1-D  elliptic  problem 
when  m  =  4. 
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If  the  functions  in  Gx  are  instead  characterized  by  the  condition 


jm 

l 


<  100 -m! 


Vi  €  [0, 1]  , 


then  the  lower  bound  on  5)  is  scaled  by  a  factor  of  1001/rri .  Under  this  condition, 
the  upper  bound  on  e  guaranteeing  that  iVe>1(. 5)  >  1000  is  scaled  by  a  factor  of  100. 


m 

1 

2 

3 

4 

€  < 

4.63  x  10“4 

2.19  x  10“7 

9.56  x  10"1 

4.16  x  10'14 

4.1.2  Upper  Bound  on  iVt]1(.5) 

For  a  given  number  of  subintervals  in  [0,1],  let  gx  be  the  approximation  to  gx  described 
in  Section  2.2.1  on  page  50.  Let  a  be  a  Green's  function  method  that  approximates  u(.5) 
by 

u(.5)  =  /  ^,(.5,i)p,(i)rfi  +  .5-52  +  o  -53  • 

Jo 

Since  u2(. 5)  and  u3(.5)  are  computed  exactly  by  a,  an  upper  bound  on  the  error  in  ap¬ 
proximating  U!(.5)  is  also  an  upper  bound  on  the  error  in  approximating  u(. 5).  Therefore, 
if 

/  |^i(.5,i)|  •  Is^x)  -  gx(x)\dx  <  t  , 

Jo 

then  a  €  /te({.5})  and  Ar(ij(.5)  <  Na}0.  Thus,  we  can  improve  the  upper  bound  in 
Theorem  2.3  by  modifying  the  argument  so  that  e  is  the  bound  on  the  error  in  the 
approximation  to  u i(.5)  instead  of  e/3. 

The  improved  upper  bound  on  NljX{.b)  for  this  problem  is 


N, 


tl(.5)  <  2  •  m  •  (cx  •  J  cc)|  cfx 


_  o  m  ;cx 
-  2  m-  I  — 


{ml)’ 


■(ml) 

+  1  i 


-l  n 

e 


x 


+  1 


where  C\  is  defined  in  Lemma  2.12.  C\  can  be  shown  [Atk78,  page  110]  to  be  the 
maximum  value  of  the  function 

k 


m 

ri(* 

it=i 


m  +  1 


m\ 
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for  x  €  [0, 1].  This  function  is  maximized  when  x  —  0,  and 


A  (m  +  l)m  1 


(m+iy 


Therefore, 


N€, i(.5)  <  2 


/m!\  m  /1\  « 

7  vTj  'w 


+  i  . 


We  will  refer  to  this  upper  bound  by  Figures  4.2  to  4.5  on  pages  91  and  4.5 

also  contain  graphs  of  this  upper  bound  on  5)  as  a  function  of  e  for  m  =  1,2,3, 
and  4  respectively.  Note  that  Nc ,i(.5)  is  no  more  than  1000  when  e  is  greater  than  the 
values  in  the  following  table. 


t  >  1.25x10"“  4.45  x  10-7  2.54  x  10"9  1.97  x  10'11 


As  before,  if  the  functions  in  G\  are  characterized  by  the  condition 

jpz9ix)  <  100  m!  Vz  G  [0, 1]  , 

then  the  upper  bound  is  scaled  by  a  factor  of  1001/m.  Similarly,  the  lower  bound  on  e 
guaranteeing  that  NtA(.b)  <  1000  is  scaled  by  a  factor  of  100. 


4.1.3  Comparison  of  Bounds 

The  bounds  in  Sections  4.1.1  and  4.1.2  are  not  tight.  But  the  ratio  of  the  upper  bound 
N^\.5)  to  the  lower  bound  N^\.5)  is  bounded  from  above  by  the  expression 


fl(m)  =  2  ._!!!_.  ( 2.H±f)-  .(>_+  ‘)-(m+2)_.m! 
m  +  1  V  m  +  1  /  V  2m 


+  1  . 


for  all  e  >  0.  R(m)  is  a  good  approximation  to  the  ratio  for  small  e.  Values  of  R(m)  are 
listed  in  the  following  table  for  m  =  1,2,3,  and  4. 


28.0  15.2  14.8  15.7 
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Thus,  the  upper  and  lower  bounds  differ  by  approximately  one  order  of  magnitude.  Note 
that,  for  large  m,  R(m)  is  approximately  1.5  ■  m.  In  fact,  R(m)  is  an  increasing  function 
of  m  for  m  >  3.  For  example,  if  m  =  10,  then  R(m)  is  approximately  24.5,  while  if 
m  =  100,  then  R{m)  is  approximately  160. 


4.1.4  Bounds  on  Parallel  Complexity 

By  Lemma  2.1, 


/(+) 


log2  £AU-5) 


1 1=0 


is  a  lower  bound  on  the  parallel  complexity  of  any  algorithm  that  satisfies  an  error  tolerance 
of  £  when  approximating  the  solution  value  u(.5).  In  Section  4.1.1  we  showed  that 
,Ve  2  =  jVe,3  =  1  if  £  is  finite,  and  we  calculated  a  lower  bound  on  JV, tl(.5).  Therefore,  by 
the  notation  of  Section  4.1.1,  a  lower  bound  on  the  parallel  complexity  is 


/(«•  flog,  (2  +  A#'(.5))1  . 


(4.5) 


In  Section  4.1.2  we  described  an  algorithm  that  approximated  u(.5)  and  satisfied  an 
error  tolerance  e.  It  used 

2  +  N%\.  5) 

sampling  locations,  where 

The  same  approach  can  be  used  to  approximate  u(z)  for  any  other  z  £  (0. 1),  in  which 
case  the  total  number  of  sampling  locations  is 


2  +  N%\z) 


As  in  the  proof  to  Theorem  2.8,  if  an  algorithm  a  approximates  u(z )  in  this  fashion  for  each 
-  €  Z,  then  the  computation  of  each  value  is  independent,  and  the  parallel  complexity 
can  be  as  low  as 


/(*)  +  /(+)’rTe\x  [l°S2  (2  +  N^(z)) 
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upper  bound  ( - )  lower  bound  ( •  •  • ) 

Figure  4.6:  Bounds  on  minimum  parallel  complexity  as  a  function  of  c  for  example 
1-D  elliptic  problem  when  m  =  2. 

The  calculation  requiring  the  most  sampling  locations  is  the  one  which  maximizes  the 
expression 

Jo  |#i(*,x)|rfx  =  \{z  -  z 2)  . 

Since  this  is  maximized  when  z  =  .5, 

/(.)  +  /(+)•  [log2(2  +  iVt(J)(.5))l  (4.6) 

is  an  upper  bound  on  the  minimum  parallel  complexity  of  optimal  parallel  algorithms, 
independent  of  whether  .5  6  Z. 

Assume  that  /(«)  =  /(+)  and  that  m  =  2.  Then  Figure  4.6  is  a  graph  of  the  upper 
and  lower  bounds  on  the  parallel  complexity  of  optimal  parallel  algorithms  described  in 
inequalities  4.5  and  4.6.  The  parallel  complexity  is  guaranteed  to  be  greater  than  10  •  /(+j 
when  e  <  10-9,  while  the  minimum  parallel  complexity  of  an  optimal  parallel  algorithm 
will  be  less  than  10  •  /<+>  when  e  >  10-6. 
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Hypercube  (—),  1-D  (•  •),  2-D  (•••),  3-D  (- -  -) 


Figure  4.7:  Lower  bound  on  minimum  parallel  cost  as  a  function  of  e  for  example  1-D 
elliptic  problem  when  m  —  2. 

4.1.5  Bounds  on  Parallel  Cost 

The  bounds  on  the  optimal  parallel  complexity  in  the  previous  section  are  very  small  unless 
e  is  very  small,  but  even  the  lower  bound  requires  at  least 

2  +  Nl»(.S) 

2 

processors  in  order  to  achieve  this  complexity.  If  fewer  processors  collaborate  in  the  approx¬ 
imation  of  u(.5),  then  the  bounds  on  the  parallel  complexity  will  be  larger.  Additionally, 
a  parallel  implementation  on  a  real  multiprocessor  will  also  involve  communication  costs. 
For  example,  Figure  3.1  on  page  83  contains  graphs  of  lower  bounds  on  the  parallel  cost 
as  a  function  of  the  number  of  collaborating  processors  when  (2 +N(ti(z))  >  1000.  As 
was  discussed  in  Section  4.1.1,  if  m  =  2  and  z  =  .5,  then  this  is  guaranteed  to  occur 
when  e  <  2.19  x  10-9. 

By  the  discussion  in  Section  4.1.4,  2-f  N^{.5)  is  a  lower  bound  on  the  number  of  sam¬ 
pling  locations  required  to  approximate  u(. 5)  when  e  is  finite.  Therefore,  by  corollary  1.2 
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on  page  16, 


min  max  \  r(p) ,  /(+)  • 

p€{  1 . <»} 


2  +  Afff(.5) 
P 


,  /(+)  •  [log2  (2  +  Nil ?(.5))1 


is  a  lower  bound  on  the  parallel  cost.  Assume  that  m  —  2  and  that  t  =  /(+)  for  all 
multiprocessor  architectures  under  consideration.  Then  Figure  4.7  on  page  97  is  a  graph 
of  this  lower  bound  for  a  variety  of  architectures.  It  is  clear  from  the  Figure  that  the 
lower  bound  on  the  parallel  cost  is  significantly  larger  than  the  lower  bound  on  the  parallel 
complexity  when  the  multiprocessor  is  a  d  dimensional  grid  and  e  is  small. 


4.2  1-D  Hyperbolic  Example 


Consider  the  problem 


d 2 
d? 

u(x,t )  -  -d^u(x,t)  =  0 

for 

(x,  t)  G  [0, 1]  x  [0,  .1] 

^u(x,0)  =  <7i(x) 

for 

x  G  [0, 1] 

u(x,0)  =  0 

for 

x  G  [0,1] 

O 

C-i- 

II 

O 

II 

for 

t  G  [0,  .1]  ■ 

The  integral  representation  of  the  solution  is 

u(z.t)  =  f  (z,t,x)gx{x)dx  , 
Jo 


where 

1 0  otherwise 

when  z  G  (f,l  —  <)•  If  z  (<,  1  —  0-  then  the  definition  of  ^(z,  t,  x)  is  similar,  but  the 
support  is  contained  in  an  interval  of  length  less  than  2 1.  Thus,  there  is  one  component, 
and  dt  =  1 . 

Let  m  be  a  positive  integer.  Assume  that  gx  is  some  member  of  a  set  G\  that  is 
characterized  by  the  condition  that 


dxm 


g(x) 


<  m\ 


Vx  G  [0.1] 


if  g  G  G\ . 
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4.2.1  Lower  Bound  on  i (.5,  .1) 

Assume  that  (.5,  .1)  €  Z.  The  graph  of  ^i(.5,  .l,x)  as  a  function  of  x  is  displayed 
in  Figure  4.8  on  page  100.  The  function  ^1(.5,.l,x)  is  continuous  and  positive  for 

x  €  (.4,  .6].  Let 

/  6  -  ,4\m+1 

c.  =  Ci-  min  |®i(.5,.l,x)|  •  m!  •  ( — - — 

r€[.4,.6]  \  Z  / 

m  1/1  \m+1 

m  +  1  2  V 10  7 

By  the  same  modification  of  Theorem  2.2  used  in  Section  4.1.1,  if  e  <  e„.  then 


since  C[  =  C\.  As  in  Section  4.1.1,  we  also  have  the  lower  bound 

iV(il(.5,.l)>  [m/2]  (4-8) 

if  e  >  0. 

Inequalities  4.7  and  4.8  together  describe  a  lower  bound  on  N(i i(.5,  .1)  for  all  e  >  0. 
We  will  refer  to  this  bound  by  .1).  Figure  4.9  on  page  101  contains  the  graph  of 

this  lower  bound  as  a  function  of  e  for  m  —  2. 

-Ve,i(.5,  .1)  is  guaranteed  to  be  larger  than  1000  when  e  is  less  than  the  values  in  the 
following  table. 


And,  as  before,  if  the  functions  in  G\  are  instead  characterized  by  the  condition 

dm 

T~^ff(x)  S:  100  ■  m!  Vx  €  [0. 1]  , 
ax 

then  the  lower  bound  on  /Ve,i(.5,.l)  is  scaled  by  a  factor  of  100I/,m.  Similarly,  the  upper 
bound  on  e  guaranteeing  that  N(< i(.5,.l)  >  1000  is  scaled  by  a  factor  of  100. 
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x 


Figure  4.8:  ^(.S,  .l,x)  for  one  dimensional  hyperbolic  PDE  example. 


4.2.2  Upper  Bound  on  Ar€ii(.5,  .1) 

Since  we  know  that  ^i(  5,  .1,  x)  is  zero  when  x  $  (.4,  .6],  we  can  improve  on  the  algorithm 
described  in  Section  2.2.1.  For  a  given  number  of  subintervals  in  [.4,  .6],  let  gx  be  the 
approximation  to  gl  described  in  Section  2.2.1.  Let  a  be  a  Green’s  function  method  which 
approximates  u(.5)  by 

fi(-5)  =  I  . 

Then  the  upper  bound  in  Theorem  2.3  can  be  strengthened  since  only  one  fifth  as  many 
subintervals  of  length  2~‘/° 1  are  required  to  cover  [.4,. 6]  as  are  required  to  cover  [0,1]. 
The  improved  upper  bound  on  lV{il(. 5)  for  this  problem  is 
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102Or  : 
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io12 :  : 

N  :  -  : 
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io4 :  '  '  ^  : 
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IO'20  IO"16  10~12  IO"8  IO"4  10u 

e 

upper  bound  ( - )  lower  bound  (•  •  •) 

Figure  4.9:  Bounds  on  7Vfl(.5,.l)  as  a  function  of  e  for  example  1-D  hyperbolic 
problem  when  m  =  2. 


We  will  refer  to  this  upper  bound  by  N[UX\. 5,.l).  Figure  4.9  also  contains  the  graph  of 
this  upper  bound  on  Ne^(. 5,  .1)  as  a  function  of  e  for  m  =  2. 

JVt'i(.5,.l)  is  no  more  than  1000  when  e  is  greater  than  the  values  in  the  following 
table. 


m 

1 

2 

3 

4 

t  > 

2.00  x  10"5 

1.43  x  10‘8 

3.25  x  IO"11 

6.06  x  10~13 

If  the  functions  in  G\  are  characterized  by  the  condition 


dx' 


:9{x) 


<  100  -ml 


Vi  6  [0,1]  , 


then  the  upper  bound  is  scaled  by  a  factor  of  100Vm,  and  the  lower  bound  on  e  guaran¬ 
teeing  that  N(% i(.5,.l)  <  1000  is  scaled  by  a  factor  of  100. 
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4.2.3  Comparison  of  Bounds 

The  ratio  of  the  upper  bound  N^\. 5,  .1)  to  the  lower  bound  .1)  is  bounded 

from  above  by  the  expression 

R(m)  =  4  •  •  (2  •  ^1)  m  •  (m!)">  +  1 

for  all  t  >  0.  Values  of  R(m)  are  listed  in  the  following  table  for  m  =  1,2,3,  and  4. 


m 

12  3  4 

R(m) 

9.00  7.53  8.56  9.91 

The  upper  and  lower  bounds  are  a  little  tighter  for  this  example  than  for  the  1-D  elliptic 
problem  example,  but  they  still  difFer  by  approximately  one  order  of  magnitude  when 
m  <  4.  Asymptotically,  R(m)  is  again  approximately  1.5  •  m,  and  R[m)  is  an  increasing 
function  of  m  for  m  >  2.  If  m  =  10,  then  R(m)  is  approximately  19,  while  if  m  =  100, 
then  R(m)  is  approximately  153. 

4.2.4  Bounds  on  Parallel  Complexity  and  Cost 

A  lower  bound  on  the  minimum  parallel  complexity  of  optimal  parallel  algorithms  is 

/(+)•  [log!(/V<J>(.5,.l))]  . 

It  is  straightforward  to  show  that  N[UX\. 5,  .1}  is  an  upper  bound  on  NtiX(z,t)  for  all 
(z,t)  €  (0,1)  x  [0,  .1] .  Therefore,  an  upper  bound  on  the  minimum  parallel  complexity 
of  optimal  parallel  algorithms  is 

/(.)  +  /<+)•  [logJ(JV,<?>(.5,.l))]  . 

Assume  that  /(„)  =  /(+)  and  that  m  =  2.  Then  the  parallel  complexity  is  guaranteed 
to  be  greater  than  10  •  /(+>  when  e  <  IO-10,  while  the  minimum  parallel  complexity  of 
an  optimal  parallel  algorithm  will  be  less  than  10  •  /(+)  when  e  >  10-7.  If  fewer  than 
Ar,,i(.5,.l)/2  processors  are  available,  or  if  the  multiprocessor  is  based  on  a  d  dimensional 
grid,  then  the  actual  parallel  cost  will  be  significantly  larger  than  these  bounds  on  the 
minimum  parallel  complexity. 
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4.2.5  Generalization  of  1-D  Hyperbolic  Example 


Consider  the  problem 

■jjpu(x,t)  -  -£pu(x,t)  =  0  for 
=  gx(x)  for 

u(x,  0)  =  h(x)  for 

u(0,  t)  =  0  =  u(l,  t)  for 


(x,t)  6  [0,1]  x  [0,  .1] 
*€[0,1] 
x  e  [o,  l] 

t  €  [0,  .1]  . 


If  z  €  (t,  1  —  t),  then  the  integral  representation  of  the  solution  is 

u(M)  =  jf  '$l(z,t,x)g1(x)dx  +  ^h(z-t)  +  -h(z  + 1)  , 

where  is  the  same  function  described  in  Section  4.2.1.  Thus,  this  solution 

operator  has  three  components: 

ui(z,t)  =  /  'P1(2,i,x)^i(x)dx  , 

./o 

U2 {z,t)  =  2 Hz~t)  , 

U3 {z,t)  =  ~ h{z  +  t )  . 

This  solution  operator  does  not  satisfy  the  assumptions  of  Section  1.3.  Components  2 
and  3  are  restrictions  of  a  common  function  to  the  zero  dimensional  manifolds  x  =  z  —  t 
and  i  =  2  +  (,  If  h  is  smooth,  then  approximate  values  of  h  can  be  calculated  without 
knowing  h  at  these  two  locations.  For  example,  if  e  is  large,  but  finite,  then  h(z)  may  be 
an  adequate  approximation  to  both  h(z  —  t)  and  h(z  +  t). 

It  is  still  straightforward  to  bound  the  parallel  complexity  for  this  problem.  Assume  that 
h  is  not  a  linear  function.  Assume  that  z£  {t,l—t).  Define  iVe23(2,  t)  to  be  the  minimum 
number  of  sampling  locations  of  h  in  [0, 1]  required  to  approximate  u2(z,  t)  +  u3(z,t)  to 
within  an  error  tolerance  of  e.  The  error  is  uncontrolled  unless  as  least  one  sampling 
location  is  used.  Therefore,  Nu2,3{z,t)  >  1  for  all  finite  e  >  0.  Since  knowing  the  values 
h(z  —  t)  and  h{z  +  t)  is  sufficient, 


Nt,2,3(z,  t)  <  2 
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These  inequalities  define  upper  and  lower  bounds  on  jV<i2i3(z,  t).  Thus,  if  ( z,t )  €  Z  and 
z  6  (f,  1  —  t),  then  upper  and  lower  bounds  on  the  parallel  complexity  are 

/(+)■  flogl  (i  +  .I))] 

and 

/<■)+/<+)•  flogj(2  +  JV<y»(.5,.l))]  , 

respectively,  where  N^\. 5,.l)  and  are  defined  in  Sections  4.2.1  and  4.2.2. 

Given  assumptions  on  h,  tighter  bounds  can  be  calculated. 


4.3  2-D  Elliptic  Example 


Consider  the  problem 

\ 

u(xi,x2)  =  g\{xi,x7)  for  (x,,x2)€B((.5,.5);.5) 
u(x!,x2)  =  0  for  (x!,x2)  e  d£((.5,  .5);  .5)  . 


a2  *2 


dx%  dx2l 


The  integral  representation  of  the  solution  is 

u(zi,z2)=  yl(zl,z2,xi,x2)gi(xl,x2)dx1dx2  , 

JB({.  5,.5);.S) 

where  ^1(21,  z2,  Xi,  x2)  is  the  function 


L  .  log  f  (-25  -  (Zl  -  -5)  (xi  -  .5)  +  (z2  -  .5)  (x2  -  .5))2 
t7r  \  -25  •  ((2!  -  Xj)2  +  (22  -  X2))2 


((21  -  .5)  (x2  -  .5)  -  (z2  -  .5)  (xi  -  -5))2 
•25  •  ((2,  -  xj)2  +  ( 22  -  x2)) 

This  can  be  put  into  the  standard  form  by  defining  to  be  zero  when  (zj,  z2)  or  (xu  x2) 
are  outside  B((. 5,  .5);  .5).  The  data  function  gx  must  also  be  extended  to  the  larger 
domain  of  /2l  but  its  values  outside  of  J3((.5,  .5);  .5)  do  not  influence  the  solution  function. 
Thus,  there  is  one  component,  and  dj  =  2. 
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Let  m  be  a  positive  integer.  Assume  that  g\  is  some  member  of  a  set  G\  that  is 
characterized  by  the  condition  that 

max 

£€Aj(m) 


a>  l-\ 

W>9{x) 


m! 


V(*i,x2)  €  [0,1]  x  [0,1] 


if  g  G  Gj.  For  example,  (x™  +  x£*)  e  G 


4.3.1  Lower  Bound  on  .5) 


Assume  that  (.5,  .5)  €  Z.  If  (xi,x2)  €  B((. 5,  .5);  .5),  then 


'Pi(.5,.5,x1,x2)  =  —  log 
4  7T 


4  •  (x,  -  .5)'  +  (x2  -  .5)2 


=  ^i(.5,  .5,  r)  , 

where  r  =  yj(xi  —  .5)2  +  (x2  —  .5)2.  The  graph  of  ^i(.5,.5,r)  as  a  function  of  r  is 
displayed  in  Figure  4.10  on  page  106.  The  function  ^i(.5, .5, xj, x2)  is  continuous  and 
positive  in  any  subset  of  /2  that  excludes  the  location  (.5,  .5).  But  the  proof  of  The¬ 
orem  2.2  is  unchanged  even  if  the  location  (.5,. 5)  is  included.  Therefore,  consider  the 
closed  ball  B((.5,  .5);  a)  for  a  €  (0,  .5),  and  define 


e(a)  =  Ci-  min  I ^ i (.5,  .5,  r)|  •  m!  •  am+2 

re(0,Q) 

By  our  generalized  proof  of  Theorem  2.2,  if  e  <  e(a),  then 


AU-5,.5)  > 


min  |^i(.5,.5,r)| 
’■e[o,o] 


m!  • 


x 

m 


(4-9) 


By  the  proof  of  Theorem  2.2,  C[  =  (l/5)m+2  •  C\.  And,  by  the  proof  of  Lemma  2.5 
in  Appendix  A, 


Ci 


7T 

m  ■  ml  ■  m\ 

7T 


(- - — 

V  2  m  +  2 

m  +  1 


m  ■  m!  •  m!  m  +  2 
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Therefore,  inequality  4.9  becomes 

AT,.,(.  5,.5)>  (j) 


m  4-  1 


a4  0 


2  •  m  ■  ml  •  (m  +  2) 

This  bound  is  maximized  when  a  =  .5-e~l^m+2K  The  resulting  lower  bound  on  iVe,i(.5,  .5) 

2i=S±21  /  \  .  il 


IS 


-V«.i(-5,.5)  > 


(n 


m  +  1 


(i): 


(4.10) 


2  •  e  •  m  ■  m!  •  (m  +  2)2 
when  e  <  c(.5  •  e-l/(m+2)).  The  same  type  of  argument  used  in  Section  4.1.1  proves  that 


.5)  >  [m/2]  (4.11) 

if  c  >  0. 

Inequalities  4.10  and  4.11  together  describe  a  lower  bound  on  iVtil(.5,  .5)  for  all  e  >  0. 
As  before,  we  will  refer  to  this  bound  by  N^(.b,.o).  Figure  4.11  on  page  107  contains 
the  graph  of  this  lower  bound  as  a  function  of  e  for  m  =  2. 

jVe  i(.5,  .5)  is  guaranteed  to  be  larger  than  1000  when  t  is  less  than  the  values  in  the 
following  table. 
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upper  bound  ( - )  lower  bound  (•  •  •) 


Figure  4.11:  Bounds  on  Ne< i(.5,  .5)  as  a  function  of  t  for  example  2-D  elliptic  problem 
when  m  =  2. 


m 

1 

2 

3 

4 

e  < 

1.29  x  IO"6 

8.62  x  IO"10 

5.17  x  10"13 

2.66  x  IO'16 

If  the  functions  in  G\  are  instead  characterized  by  the  condition 

<  100  •  m!  V(xi,x2)  e  [0, 1]  x  [0, 1]  , 


max 


d * 


then  the  lower  bound  on  iV{il(.5,.5)  is  scaled  by  a  factor  of  1002/m.  Similarly,  the  upper 
bound  on  e  guaranteeing  that  iVf  l(.5,.5)  >  1000  is  scaled  by  a  factor  of  100, 


m 

1  2 

3 

4 

e  < 

1.29  xlO"4  8.62  x  10"8 

5.17  x  10"11 

2.66  x  IO"14 

4.3.2  Upper  Bound  on  N(A(. 5,  .5) 

For  a  given  number  of  subcubes  in  [0, 1]  x  [0, 1],  let  gi  be  the  approximation  to  gx  described 
in  Section  2.2.1.  Let  a  be  a  Green’s  function  method  which  approximates  u(. 5,  .5)  by 

u(.5,.5)  =  /  ^i(.5,.5,x1,x2)5i(xi,x2)c/xi  dx2  . 
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The  upper  bound  in  Theorem  2.3  is 

N', i(-5,  -5)  <  4  •  m2  •  |^i(.5,  .5,xi,x2)|  dx\  dx^j  ™  •  (m!)™  •  "*  +  1 


-  4  s)*  • (m!)4  •  ®  *  +  1  • 


By  the  proof  of  Lemma  2.12  in  Appendix  B  and  the  discussion  in  Section  4.1.1,  we  know 
that 

1  +2m 


Ci  = 


Therefore 


W,.i(.5,.5)<4.(^): 


(m  +  l)r 


2m  +  1 
32 


(>*!)"  '  (“) 


+  1  . 


We  will  refer  to  this  upper  bound  by  N^\. 5,  .5).  Figure  4.9  also  contains  the  graph  of 
this  upper  bound  on  W£>1(.5,  .5)  as  a  function  of  c  for  m  =  2. 

iV{il(.5,  .5)  is  no  more  than  1000  when  e  is  greater  than  the  values  in  the  following 
table. 


m 

1 

2 

3 

4 

e  > 

2.97  x  10"3 

5.56  x  10~4 

1.80  x  10~4 

8.37  x  10"5 

If  the  functions  in  Gx  are  characterized  by  the  condition 

<  100  ml  V(ii,i2)  €  [0, 1]  x  [0, 1]  , 


max 


aPs(i) 


then  the  upper  bound  is  scaled  by  a  factor  of  1002//m,  and  the  lower  bound  on  e  guaran¬ 
teeing  that  JVe, i(.5)  <  1000  is  scaled  by  a  factor  of  100. 

4.3.3  Comparison  of  Bounds 

The  ratio  of  the  upper  bound  N^\. 5,  .5)  to  the  lower  bound  N^\. 5,  .5)  is  bounded 
from  above  by  the  expression 

'25  •  e  2m  -I-  1 


R(m)  =  400 


m  +  1 


m  ■  (m!)2  •  (m  +  2) 


m 


m  +  1 


+  1 
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for  all  e  >  0.  Values  of  R(m )  are  listed  in  the  following  table  for  m  =  1,2,3,  and  4. 


m 

1 

2 

3 

4 

R(m) 

5.26  x  107 

6.44  x  106 

4.95  x  106 

5.60  x  106 

The  upper  and  lower  bounds  are  much  worse  for  this  example  than  for  the  one  dimensional 
examples.  They  now  differ  by  approximately  seven  orders  of  magnitude  when  m  <  4. 
Techniques  for  improving  the  bounds  are  described  in  Section  4.3.5.  Asymptotically, 
R(m)  is  approximately  29.3  ■  m4,  and  R(m)  is  an  increasing  function  of  m  for  m  >  3. 

4.3.4  Bounds  on  Parallel  Complexity  and  Cost 

As  in  Section  4.2.4,  a  lower  bound  on  the  minimum  parallel  complexity  of  optimal  parallel 
algorithms  is 

/(+)•  [l°8IKf’(-5,.5))]  . 

It  is  again  straightforward  to  show  that  .5)  is  an  upper  bound  on  iVttl(21,22) 

for  all  (zi,22)  G  f?((.5,  .5);  .5).  Therefore,  an  upper  bound  on  the  minimum  parallel 
complexity  of  optimal  parallel  algorithms  is 

/(■)  +  /,+)•  fl°62(A'S’(.5,.5)))  ■ 

Assume  that  /(.)  =  /(+),  and  that  m  =  2.  Then  the  parallel  complexity  is  guaranteed 
to  be  greater  than  10  •  /(+)  when  e  <  10~10,  while  the  minimum  parallel  complexity  of 
an  optimal  parallel  algorithm  will  be  less  than  10  •  /(+>  when  t  >  10-3.  As  before,  if 
fewer  than  5,  .5)/2  processors  are  available,  or  if  the  multiprocessor  is  based  on  a 
d  dimensional  grid,  then  the  actual  parallel  cost  will  be  significantly  larger  than  these 
bounds  on  the  minimum  parallel  complexity. 

4.3.5  Improving  the  Bounds 

The  bounds  calculated  for  this  problem  describe  nontrivial  constraints  on  the  parallel  cost 
of  computing  the  solution,  but  the  discrepancy  in  the  upper  and  lower  bounds  is  very 
large.  There  are  several  ways  to  improve  these  bounds.  For  concreteness,  we  will  limit 
the  discussion  of  the  amount  of  improvement  to  the  case  m  =  2. 
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a)  ^(z^,  z2,xi,x2)  is  identically  zero  when  (xi,x2)  is  outside  the  ball  B((. 5,  .5);  .5), 
and  sampling  locations  outside  the  ball  are  not  really  needed.  Thus,  when  construct¬ 
ing  the  upper  bound,  g\  only  needs  to  be  approximated  in  subcubes  that  intersect 
the  ball.  This  decreases  the  upper  bound  by  a  factor  of  approximately  ir/4  when 
the  subcubes  are  small.  This  is  the  same  approach  used  to  improve  the  estimates 
for  the  example  1-D  hyperbolic  problem. 


b)  The  generalizations  described  in  Sections  2.1.4  and  2.2.2  base  the  sampling  fre¬ 
quency  in  any  subregion  R  on  the  value  of  in  that  subregion.  If  the  volume  of 
R,  Vol (R),  is  small  and  .5, xj,x2)  is  continuous  in  R,  then 

/  .5,  ii,  x2)|  dxx  dx2  ~  min  .5,  ij,  x2) 

JR  jefl 

Since  is  a  type  1  function,  the  upper  and  lower  bounds  on  the  sample  size  are 

only  functions  of  over  regions  where  it  is  continuous.  Thus,  the  upper  and  lower 

bounds  will  be  similar  functions  of 

We  can  estimate  the  amount  of  improvement  that  is  possible  from  using  these 
generalizations.  Assume  that  ^,(.5,  .5,  xi,x2)  =  1  when  (xi,x2)  €  B((.5,  .5);  .5), 
and  is  identically  zero  outside  the  ball.  Now  the  upper  and  lower  bounds  depend  on 
in  exactly  the  same  way.  If  m  =  2,  then  the  ratio  of  the  bounds  for  this  problem 
is  approximately  1/27  times  the  ratio  of  the  bounds  for  the  original  problem. 


c)  The  calculation  of  the  lower  bound  on  the  number  of  sampling  locations  was  very 
rough.  In  particular,  in  Lemma  2.11  the  optimal  placement  of  the  sampling  locations 
for  this  construction  was  simply  shown  to  satisfy 


6 

niM 


in  a  d  dimensional  ball  B(x,\6).  Here  rk  is  the  maximum  distance  between  the 
A:th  sampling  location  and  the  boundary  of  its  Voronoi  cell,  and  n  is  the  number  of 
sampling  locations  in  the  ball.  While  the  analysis  described  here  was  not  sufficient  to 
prove  it,  we  expect  that  the  optimal  sampling  locations  in  B(x.\ 6)  will  be  essentially 
equidistributed.  In  this  case 
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for  all  k.  For  this  type  of  distribution,  the  error  function  T  can  be  constructed  so 
that  almost  every  sampling  location  w  "contributes"  a  function  7^  to  I\  Lemma  2.9 
can  then  be  modified  to  state  that  the  support  of  this  new  T  essentially  covers  the 
entire  ball,  and  the  resulting  lower  bound  is  increased  by  a  factor  of  5m+d.  Thus, 
when  m  =  2,  the  lower  bound  for  the  example  2-D  elliptic  problem  is  increased  by 
a  factor  of  625. 


Improvements  a-c  can  theoretically  decrease  the  ratio  by  a  factor  of 


7T  1  1 

4  '  27  '  625 


4.66  x  10~5 


when  m  =  2.  The  ratio  of  the  improved  upper  bound  to  the  improved  lower  bound  is 
approximately  300.  Note  that  the  majority  of  the  change  reflects  an  increase  in  the  lower 
bound.  Thus,  for  this  problem,  the  upper  bound  is  a  much  better  estimate  of  the  true 
value  of  Are ,  1  ( .5 ,  .5). 

To  improve  the  bounds  further  requires  improving  the  constants  C\  and  C\.  For 
example,  C\  is  primarily  a  function  of  the  type  of  error  function  used  in  the  construction 
of  the  lower  bound.  The  error  function  described  in  Chapter  2  is  very  conservative  for 
dimensions  greater  than  or  equal  to  2.  It  is  a  C°°  function,  and  is  zero  at  many  locations 
other  than  those  required  by  the  analysis.  Thus,  we  expect  that  this  constant  can  be 
increased. 


4.4  Summary 

The  results  of  Chapters  2  and  3  were  sufficient  to  establish  the  behavior  of  NCil{z)  as 
a  function  of  e,  and  to  bound  the  parallel  complexity  and  cost  of  minimum  cost  parallel 
algorithms.  For  the  one  dimensional  example  problems  described  in  this  chapter,  we  were 
able  to  substantially  tighten  these  bounds  by  taking  advantage  of  the  specifics  of  the 
problems.  But,  even  without  using  the  improvements  described  in  Section  4.3.5,  we  were 
able  to  calculate  nontrivial  bounds  on  the  intrinsic  cost  of  the  two  dimensional  example 
problem.  These  bounds  become  even  more  striking  as  the  assumptions  on  the  function 
sets  {G\}  become  weaker. 
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The  problem  assumptions  made  in  Section  1.3  are  not  satisfied  by  all  classes  of  linear 
PDEs.  Section  4.2.5  described  a  problem  where  certain  components  can  be  approximated 
without  using  sampling  locations  in  the  domain  of  the  component.  This  is  typical  of 
hyperbolic  PDEs.  But,  as  demonstrated,  the  analysis  could  be  generalized  bv  treating 
these  components  separately.  While  this  particular  example  was  trivial,  a  similar  approach 
will  work  for  more  general  cases. 


Chapter  5 
Conclusions 


The  cost  of  calculating  a  numerical  approximation  to  the  solution  of  a  partial  differential 
equation  is  constrained  by  the  amount  of  data  it  must  use  in  order  to  calculate  an  accurate 
enough  approximation.  For  the  simple  assumptions  about  the  data  function  and  the 
partial  differential  equation  described  in  Section  1.3,  these  constraints  limit  how  quickly 
the  approximation  can  be  calculated.  As  e,  the  bound  on  the  error  in  the  approximation, 
decreases,  there  exists  a  lower  bound  on  the  parallel  complexity  that  must  increase  at 
least  linearly  as  a  function  of 

Mi)  • 

Moreover,  this  bound  is  tight  in  the  sense  that  there  exists  a  family  of  explicit  linear 
algorithms  with  parallel  implementations  whose  parallel  complexity  grows  no  faster  than 
this. 

The  number  of  processors  used  to  calculate  the  approximation  must  increase  as  the 
error  bound  decreases  if  the  lower  bound  on  the  parallel  complexity  is  to  be  achieved.  This 
usually  entails  an  increase  in  the  communication  cost.  But  the  communication  cost  does 
not  change  the  asymptotic  behavior  as  long  as  the  parallel  algorithm  remains  computation 
bound.  This  is  not  always  possible  for  small  error  tolerances  when  the  approximation  is  cal¬ 
culated  on  a  multiprocessor  whose  radius  grows  faster  than  logarithmically.  In  particular, 
the  parallel  cost  of  optimal  algorithms  on  d.  dimensional  grid  architectures  is  determined 
by  the  communication  capabilities  of  the  architecture  rather  than  by  the  minimum  parallel 
complexity.  But,  for  a  given  processor  speed,  there  exists  a  logarithmic  bound  on  the 
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radius  of  a  multiprocessor  that  ensures  that  the  communication  cost  associated  with  the 
radius  will  not  influence  the  behavior  of  the  parallel  cost  of  optimal  parallel  algorithms. 

As  mentioned  in  Section  3.5.2,  the  assumptions  used  to  prove  these  results  are  merely 
sufficient,  and  not  necessary.  We  contend  that  they  are  reasonable  assumptions  for  a  priori 
estimates  on  the  error  for  mth  order  accurate  approximations,  m  <  oo,  to  the  solution  of 
many  important  classes  of  linear  scalar  partial  differential  equation.  And  the  results  can 
be  extended  to  any  problem  with  a  solution  operator  that  can  be  represented  as  a  sum 
of  integral  components  if  at  least  some  of  the  components  satisfy  the  assumptions.  A 
simple  example  of  this  was  described  in  Section  4.2.5 

The  assumptions  and  analysis  described  in  the  previous  chapters  are  also  reasonable 
for  other  classes  of  problems. 

•  Very  little  changes  for  systems  of  linear  partial  differential  equations.  Many  linear 
problems  also  have  integral  representations  for  the  solution  operator  for  each  element 
of  the  solution  vector,  where  the  kernels  satisfy  the  usual  assumptions. 

•  For  nonlinear  problems  the  solution  operator  will  not  have  such  a  simple  representa¬ 
tion.  But  most  numerical  approximations  to  nonlinear  problems  can  be  interpreted 
as  solutions  to  a  sequence  of  linear  problems.  Thus,  the  results  carry  over  immedi¬ 
ately  if  these  linear  subproblems  satisfy  the  assumptions.  The  only  difficulty  is  that 
the  solution  u  is  not  necessarily  smooth  when  the  data  is  smooth.  Thus,  the  num¬ 
ber  of  solution  values  required  to  represent  the  solution  function  can  be  arbitrarily 
large.  For  most  practical  applications  the  solution  is  smooth  almost  everywhere,  and 
discontinuities  are  limited  to  a  finite  number  of  bounded  smooth  lower  dimensional 
surfaces.  The  additional  cost  is  then  in  capturing  or  tracking  these  surfaces.  This 
is  also  approximated  by  the  calculation  of  a  finite  number  of  values  dependent  on 
the  error  bound. 

•  The  data  functions  are  discontinuous  for  many  applications.  But,  similar  to  the 
discussion  of  nonlinear  problems,  the  data  is  generally  smooth  on  all  but  a  finite 
number  of  bounded  smooth  lower  dimensional  surfaces.  The  analysis  follows  through 
as  before  as  long  as  the  solution  is  also  smooth  except  on  a  finite  number  of  bounded 
smooth  lower  dimensional  surfaces  for  this  type  of  data. 
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•  The  analysis  only  depends  on  the  solution  operator  having  the  required  integral  for¬ 
mulation.  Thus,  the  results  are  not  restricted  to  linear  partial  differential  equations. 


Appendix  A 

Construction  of  the  Error 

Function  in  1^. 

2 


In  this  appendix  we  prove  Lemma  2.5: 

Assume  that  d{  >  0.  For  any  x.  €  Id,  and  6  >  0  such  that  B(x.;6 )  C  Id,, 
there  exists  a  function  7  €  C°°[Id,)  with  the  following  properties. 

1)  7  €  G„ 

2)  7(1)  =  0  for  all  x  $  B(x.;6). 

3)  7 (x)  >  0  for  all  x  6  B(x,-,8). 

4)  There  exists  a  constant  C,  >  0  depending  only  on  mi,  d,,  and  ||  •  ||(,) 
such  that 

f  7 (x)dx  >  Ci  ■  6m,+di  •  min  M,(x)  , 

Jii,  ~  ieS(  *.;«) 

where  B(x.\6)  is  the  closure  of  B(x.;6). 

We  will  prove  the  Lemma  by  constructing  a  spherically  symmetric  function  in  5(x.;6) 
whose  integral  is  the  limit  of  integrals  of  functions  that  satisfy  conditions  1-3.  We  begin 
by  proving  a  condition  on  the  radial  derivative  that  insures  membership  in  G,. 
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A.l  Sufficient  Bounds  on  Directional  Derivatives 

Functions  in  G,  are  characterized  by  the  bound 

l|Vl”">jr(x)||(„  <  M,(x) 

and  membership  in  Cm'(Idt).  A  sufficient  condition  for  g  to  be  in  G,  is  described  by  the 
following  lemma. 

Lemma  A.l  Let  g  be  a  C°°(Id,)  function  with  the  following  property.  There  exists 
a  constant  6  >  0  ,  a  location  x.  6  Id,,  and  a  function  g(r)  such  that 


1)  g(x)  =  g(||x  —  1.1)2)  in  the  d{  dimensional  open  ball  B(x.;S). 

2)  g(x )  =0  ifx  &  B{x.;6). 


Assume  further  that  g  satisfies  the  condition  that 

J>>  - 0 

for  all  s  6  {1, . . . ,  m;  —  1} .  Then  there  exists  a  constant  Ci  >  0  depending  only  on 
II  '  ||  (i)t  mi>  and  dt  such  that  g  G  G,  when 


max 

r€[0,«] 


dm' 

IP*' 


9(r) 


<  C,  •  min  Af,(x)  . 

r€S(r.;$) 


Proof:  By  the  equivalence  of  norms  over  finite  dimensional  spaces  [Atk78],  we  know  that 
there  exist  positive  constants  ci  and  c2  such  that 

c,  •  l|V£"',s(i)ll„  <  ||vi7'lj(x)||(,|  <  c,  •  ||V5”'»J(X)IU  .  (A.l) 

where  ||(xj, . . . ,  x,)||oo  =  max;6{i . |x*|.  Therefore,  if 

nvsr’jwiu  <  , 

then 

nvir'9(x)n(1)  <  Mi(±] . 
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Since  g  G  C°°{Id,),  g  is  also  a  member  of  Cm,(Idi)  for  any  m,.  Therefore,  to  prove  the 
result,  it  is  sufficient  to  find  a  constant  C{  such  that 

<  —  •  Mt(x)  V/i  6  I<d,(Tni)  .  €  Id, 

C2 

when 

<  C,  •  min  A/,(x)  Vr  6  [0, 6]  . 

f  $B(£,;S) 

When  di  =  1,  it  is  clear  that  choosing  C{  =  l/c2  verifies  the  Lemma.  For  the  rest  of  this 
proof,  we  will  assume  that  d,  >  1. 

If  x  B(x.\6),  then  d*g{x)/ dx*  =  0  for  all  p  E  Kdt(m.i),  and  any  C,  >  0  will 
suffice.  Assume  that  x  £  B(x.;6).  Let  p  be  some  element  of  By  a  repeated 

application  of  the  chain  rule, 

jp9{\\x  ~  x.||a)  =  £  Q**(x)  “  z.lh)  (A.2) 

where  |qSiJ(x)|  <  m,!  •  (||x  —  x.|]2)*-m’ .  Replace  each  expression  of  the  form 


dm' 


W‘9{x) 


(d'/r')g(\ jx  -  x.||2) 


by  its  Taylor  expansion  around  r  =  0, 


dr>9i\\x  *-lla)  E  (*_s)t  drt9^  + 


|x  -  X.  2 


'm'-5  dm 


(m,-s)!  drm' 


■9(t.) 


F  -  x.  2 


im, -j  », 


(m,  —  s)!  drm' 
where  6  [0,  ||x  —  x.||2],  Equation  A.2  reduces  to 


$(€.) 


0*-,,,-  ^  (I|x-X.||2r— 

di^X  X*ll2)  ~  ■  drm,9(Z*)  ■ 


d * 

dF*(x) 


<9xM 


x  -  X.  2, 


<  m,  •  m,!  •  max 

C€[0,||x-x.||2] 


dm* 


<9rmt 


i(0 


Thus, 
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for  all  jl  €  K^rrii).  Therefore,  if 


dmi 


1 


- r  •  min  Mi(x)  Vre[0,<5]  , 


C2  ■  rrii  ■  mi!  zeB(£.;S) 


then 


d * 


dx * 


?M*) 


<  ~  •  M,(x) 
C2 


Vx€/*  . 


Thus,  choosing 


C,= 


1 


verifies  the  Lemma  when  d,  >  1.  I 


c2  •  m,  ■  m:! 


A. 2  Error  Function 


Let  M[  —  Ci  ■  min£€fl(f> ;4)  where  Ci  is  the  constant  referred  to  in  Lemma  A.l. 

Let  7(x)  =  7(||x  -  x.||2),  where 

M' 


l  0  ,  for  |r|  > 


-S  <r  <6: 
6. 


For  this  function, 


dr 


m, 


—M[ ,  for  0  <  r  <  6; 

7 (r)  =  ^  +M[ ,  for  —  6  <  r  <  0; 
1  0  ,  for  6  <  |r| 


and 


Define  k,  by1 


da 

^77(0)  =  0  for  0  <  s  <  m,  . 


2, 

Ki  =  J24-l  T, 


if  di  =  1; 

if  di  =  2  or  di  =  3; 


V-^1’  i,d-> 3- 


Then 


/  7 (x)dx  =  7(r)rd'-1 

d/d,  do 


dr 


=  «,  ^  (7 - 1— -)  • 

m,!  \d;  m,  +  di/ 


(A. 3) 


(A. 4) 


^his  form  of  the  definition  of  k,  is  taken  from  [GZ88].  It  is  called  Ck  there. 


A. 3.  CONVERGENT  SEQUENCE 
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A. 3  Convergent  Sequence 


The  function  7  is  not  in  C°°(I^).  All  derivatives  are  discontinuous  at  r  =  6,  and  the  m.th 
order  derivatives  don't  exist  at  r  =  0.  But  it  is  the  limit  of  a  sequence  of  C°°  functions 
that  are  in  G,. 

Lemma  A. 2  There  exists  a  sequence  of  function  {7*}  such  that  each  7*  satisfies 
conditions  1-3  of  Lemma  2.5,  and  for  which  the  sequence 


7*:(x)  dx 


converges  to  fj  7 (x)dx  as  k  — ►  00. 


Proof:  For  0  <  17  <  6/2,  let  77^ (r)  =  (r  -  *7)  *6/(6  —  2  •  17).  Define  the  function  7^  (r) 

^  .  y(Q)  i  for  -17  <  r  <  17; 

7»i(r)  ~  '  f6_-  2  .7(7 /^(r)),  for  17  <  |r|  <  <5  -  17; 


for  6  —  t/\  <  |r|. 


Thus, 

il, 

dr  * 

when  r  €  [— 17,17]  and  0  <  s  <  m„  and 


|t7„W  =  0 


0, 

for  |r|  <  *7; 

+a/;, 

for  — <5  +  i'l  <  r  <  —  *7; 

-Aff, 

for  17  <  r  <  6  —  17; 

0, 

for  |r|  >  6  —  17. 

Let  p(t)  be  a  spherically  symmetric  nonnegative  C°°(9?)  function  whose  support  is  in 
the  interval  (—1,1),  and  whose  integral  over  this  region  is  1.  For  0  <  1/2  <  17,  define 
7*i ,*»(r)  by  mollifying  7n(r)  in  the  following  way,2 


Jr  — 1*2  V 2 

_  r°  -  l^p(ir-t)l1' 2) 

=  /  7*,(<) - - - 

-<  -00  t-7 


2Mollifiers  and  their  attributes  are  discussed  in  [Ric78]. 
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By  integration  by  parts, 


d_ 

dr 


u  2  dra 

for  s  6  {1, . . .  ,m,  —  1}.  Therefore,  7 ^^(r)  is  a  C°°(&)  function  satisfying 


dm'  .  ,  . 


<M[  Vr(E»  , 


^7%i,^(0)  =  0  Vs  >  0  , 


and 


5 


=  0  Vs  >  0  when  |r|  >  6 


Thus,  7n,^(i)  =  7n,»^(lk  —  x.||2)  is  a  member  of  G,  by  Lemma  A.l.  Since  7^,*,(r)  =  0 
if  |r|  >  6  and  7 *fBJ(r)  >  0  if  |r|  <  S,  7^,^(x)  also  satisfies  conditions  2  and  3  of 
Lemma  2.5. 

First  we  show  that  fld  7I/I(x)dx  converges  to  fId  7(1)  dx  as  — ►  0.  Pick  some 

e  >  0.  Then 


/  7„,  (x)  dx  =  Ki-  [  ^(rjr*'1  dr 
Jo 


+  (— •  £"■  i(iM)  • 


*.  •  I  "i*  •  (  - — 7— 1  I  •  7(0) 


m, 


+ 


^  —  2  ■  //j 


m.+l 


I  7(r) 

Jo 


6-2-vr  , 

- - - ■  r  +  i/x  I  dr 


where  /Cj  is  defined  in  equation  A. 3.  Therefore, 


7i/ 


(x)  dx  —  f  7 (x)dx  <  rc.-ft'?’  •  f- — 7—^)  -7 


7(0) 
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Since 


6-  2  •  £/j 


r  +  i/i]  —  rc 


converges  uniformly  to  0  on  the  interval  [0,<5]  as  v\  — »  0,  it  is  clear  that 


/  ”7 i/i  (x)  dx  j  7 {x)dx 

•'Ui  ■'Id, 


converges  to  0  as  u\  — ►  0.  Thus,  there  exists  a  i/i>m  such  that 


j  Ju  luy{x)dx  -  7 {x)dx  <  | 


(A. 5) 


when  Vi  <  V\ 

Next  we  show  that  J[d  7 l/1<l^(x)dx  converges  to  fId  y^(x)dx  as  i/2  —*  v\.  The 
difference  between  the  two  integrals  is  bounded  by 

7m  {x)dx-Ji  7  uu*{x)dx  =  Hi  •  |jf  ^(r)  r^-1  dr  -  jf  ^^(r)  rd,-1dr 

<  ««  /  l7m(r)-7m,^(r)|-rli,'1<fr 

./O 

<  '  J_s  l7m(r)  —  7m,io(r)l  ‘  r<i’-1  dr 

=  Ki-  j  |7m(r)  ~  7m,m(r)l  *  rd,_ldr  . 

J  —  OO 


Let  j| /|| 2  be  the  Z,2  norm  in  3?,  i.e. 


= (rj2{r)drY 


Define  the  function  I#  by 


By  the  Schwartz  inequality, 


I«(r)  = 


1 ,  for  |r|  <  S\ 
0 ,  for  |r|  >  8. 


/OO 

l7m(r)  -  7m,m(r)l  '  rd'~x  dr  <  k,  • 

•OO 


7m, Jh  ■  ||i«  ’  rd' 


'  6(2d'  l)  '  K>  '  Il7m  “  7m  ■ 
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By  Theorem  2  on  page  94  of  Richtmeyer  [Ric78],  the  mollified  function  ^^(r)  converges 
to  7^  in  the  L 2  norm  as  u 2  — »  0.  Thus,  there  exists  a  v2,»  such  that 

11-  -  ,|  .  <  2di-l  1 

I'7"1  ~  2  2 6^'7X 

when  i/2  <  j/2)„  and 

Jh  7*{x)dx- 7^...^..(i)<fx  <  |  .  (A. 6) 

By  equations  A. 5  and  A.6, 

/  7 (x)dx-f  7^..,o2Jx)dx 
JI*.  Jidi 

Since  t  was  arbitrary,  this  implies  that  there  exists  a  sequence  {(17 ,»,V2,i)}  such  that 

/,  -»  /  7(x)dx 

as  i  — »  00.  I 


<  e  . 


A.4  Proof  of  Lemma 

The  proof  of  Lemma  2.5  follows  directly  from  Lemma  A. 2. 

Proof:  By  Lemma  A.2,  there  exists  a  function  7^  ,,^ ,  satisfying  conditions  1-3,  and  whose 
integral  is  at  least  half  as  large  as  the  integral  of  7.  Thus,  by  inequality  A.4, 

1  M'  ( 1 


/  dx  >  I  •  «. .  _i  .  Ti - .  s 

■Hd,  2  mi!  \di  rrii  +  diJ 


m,+d, 


.  Ci  ■  min  Mx{x) 

1  xeS(x.;«)  '  (  1 


2  K* 


m,-! 


(t - “T>) 

Vdi  m,  +  d,/ 


Therefore,  the  bound  in  the  Lemma  exists  for 

If  dx  =  1,  then  «,  =  2  and  Cx  —  l/c2,  where  c2  was  defined  in  inequality  A.l.  In  this 


C«  =  —  •  —7  •  (l - l—r) 

c2  m,!  \  m,  +  1  / 


case. 
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If  di  >  1,  then  Ci  =  (c2  •  m,  •  m,!)  1  and 

d{  mi  -(-  di ) 

I 


Ci  = 


Ki 


1 


2  •  Ci  rrii  ■  m,!  •  m,! 
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Appendix  B 
Covering  Lemma 


This  appendix  describes  the  proof  of  Lemma  2.9.  To  recap,  there  is  a  set  of  n  sampling 
locations,  W,  in  the  open  ball  B{x,\6).  These  are  used  to  generate  a  Voronoi  diagram 
as  in  Section  2.1.1.  A  function  is  constructed  within  the  ball  that  is  zero  at  all  of  the 
interior  sample  locations  and  on  the  boundary  of  the  ball.  The  construction  partitions  the 
sample  locations  into  sets  {S(u/)}  satisfying  the  condition 

S{w)  =  {w  |  || -  u>||2  <  3 r(w')  where  \\y&  -  u>'||2  =  r(tZ>/)} 

for  a  subset  W'  of  W  satisfying 

r(w')  >  r(w)  Vu>  6  S(w') ,  Vt v'  6  W'  . 

The  statement  of  the  Lemma  is  the  following: 

B(x.;6)  C  (J  fl(yc«;5r(u>'))  . 

B.l  Covering  Interior  Voronoi  Cells 

Lemma  B.l  The  closure  of  the  union  of  all  Voronoi  cells  whose  centers  are  in  W  is 
contained  in 

U  B[yv,\r[w'))  . 

w’ew 
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Proof:  The  construction  of  T  is  finished  when  W  is  empty.  Therefore,  all  locations  in  W 
are  in  one  of  the  sets  S(w')  for  w'  6  W'.  If  w  e  S(w'),  then 

lly^  -  wh  <  3 r(w')  . 

No  point  on  the  boundary  of  a  Voronoi  cell  V(w)  is  farther  away  than  r(ti)  from 
w.  Therefore,  the  closed  ball  B(w\r(w))  contains  the  closure  of  the  V(u>).  Denote  the 
closure  of  V(tD)  by  V^tD).  If  x  E  B(w\r(w))  and  w  €  S(tD'),  then 

||s/w'  —  i|h  <  \\y&  ~  +  ||u>  ~  *lh 

<  3 r(w)  +  r(w)  . 

Since  w  €  S(w')  was  removed  from  W  after  w'  was  added  to  W' ,  r(u>)  <  r(tD')  and 

^(rD)  C  B(w;r(w)  C  .  (B.l) 

Therefore,  4r(tu'))  contains  the  closure  of 


U  v«  . 

wZSiw1) 

and 

U  B{y&-,4r{w')) 

u )'€W' 

contains  the  closure  of  all  Voronoi  cells  with  centers  in  the  interior.  I 


B.2  Covering  Boundary  Voronoi  Cells 

Lemma  B.2  The  closure  of  the  union  of  Voronoi  cells  with  centers  on  the  boundary 
of  B(x.\ 6)  is  contained  in 

(J  B(ytf\5r{w'))  . 

w'ew 


Proof:  Every  location  on  the  boundary  of  B(x.',6)  is  a  Voronoi  cell  center.  Each  cell  is 
a  line  segment  with  one  end  in  the  interior  of  B(x.]6)  and  one  end  at  the  cell  center. 
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Since  all  points  in  B(x.;S)  except  xm  have  a  unique  nearest  point  on  the  boundary,  a  cell 
center  x'  £  dB(x.\6)  has  no  neighbors  on  dB(x.;6)  if  the  interior  endpoint  of  V(x')  is 
not  xm.  But  any  interior  sample  location  is  closer  to  x.  than  a  location  on  the  boundary. 
Therefore,  x,  is  contained  in  the  closure  of  some  interior  Voronoi  cell,  and  every  location 
in  dB(x.;6)  is  a  neighbor  of  some  interior  sample  location. 

All  points  on  the  boundary  of  a  Voronoi  cell  V(w)  are  halfway  between  w  and  its 
neighboring  sample  locations.  Since  r(tD)  is  the  maximum  distance  from  w  to  dV(w), 
the  closed  ball  B(ur,2r(w))  contains  ail  neighbors  of  w.  If  a  boundary  location  x'  is  a 
neighbor  of  an  interior  sample  location  tu,  then  B(w\  2r(u?))  contains  i'.  It  also  contains 
the  point  that  is  on  the  boundary  of  both  V(u>)  and  V(x').  Since  the  closure  of  V(x') 
is  the  line  segment  connecting  this  boundary  point  with  x' ,  and  since  B(u>;2r(w))  is  a 
convex  set, 

V(x')  C  B(u>,2r{w))  .  (B.2) 

As  before,  r(tD')  >  r(u))  for  all  to  £  S(w').  By  expression  B.l,  B(i/u>«;  4r(u>'))  contains 
B(w-,r(w))  for  every  to  £  S^to').  Therefore, 

B(u>]  2r(u>))  C  5r(to')) 

for  every  to  £  5(to')-  Thus,  by  expression  B.2,  B(y&<-,  5r(to'))  contains  the  closure  of 
all  Voronoi  cells  with  centers  on  the  boundary  that  have  a  neighbor  among  the  sampling 
locations  in  S(to').  Since  all  boundary  locations  are  neighbors  to  at  least  one  interior 
sampling  location, 

tj  B{y&;5r{w)) 
w'ew 

contains  the  closure  of  all  Voronoi  cells  with  centers  on  the  boundary.  I 


B.3  Proof  of  Lemma 

Proof  of  Lemma  2.9:  Every  point  in  B(x.-,6)  is  in  the  closure  of 

u  ■ 

i'€dB(x.;6)uW 
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By  Lemmas  B.l  and  B.2, 

U  B{yu,';5r{w)) 

w'ew 

contains  the  closure  of  these  same  Voronoi  cells.  Therefore 


B(x,;6)C  U  5(y**;5r(u;')) 

w'ew 


Appendix  C 

Bound  on  Interpolation  Error 

In  this  appendix  we  prove  Lemma  2.12.  Let  6  be  a  positive  constant.  Let  x .  be  a  location 

in  3 Let  h  —  6/(m  +  1).  Let 

X(s,k)  -  (**)*  -  -  +  kh 

for  1  <  s  <  di,  where  (x.),  is  the  sth  component  of  x..  Let  tt,  be  the  following  set  of 
equally  spaced  locations  in  [(x«),  —  8/ 2,  (x.),  +  6/2], 

{*(.,1),-  •  •  • 

Define  tt(/)  to  be  the  following  subset  of  3 

7 r(/)  =  7 Tj  X  •  •  •  X  7T| 

=  {*!(*).  €  7rs,  s  G  {1,.  ..,/}}  .  (C.l) 

Then  7r(d)  is  a  set  of  equidistributed  locations  in  the  d  dimensional  cube  C{x.\6). 
Lemma  2.12  can  be  restated  in  the  following  form: 

Let  g  be  a  function  defined  in  a  d  dimensional  cube  C(xm;8).  Define  tt  to  be 
the  equidistributed  locations  in  C(x.;  6)  described  in  equation  C.l.  Then  there 
exists  a  unique  polynomial  $r(x)  of  degree  at  most  m  —  1  in  each  variable  that 
interpolates  g  at  the  locations  in  i r(d).  If  g  has  mth  order  continous  partial 
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derivative  in  C(x,;6),  then  the  error  in  the  approximation  is  bounded  from 
above  by  a  term  of  the  form 


Cd  •  6m  •  max  max 
zeC(£.;6)  aeKd(m) 


where  Cd  is  independent  of  6  and  x„. 


C.l  Proof  of  Lemma 


Proof  of  Lemma  2.12:  Existence  and  uniqueness  is  proven  in  Prenter  [Pre75,  pages  118— 
121]  for  d  <  2.  The  proof  for  d  >  2  is  a  straightforward  generalization.  The  error  bound 
for  d  =  1  is  also  a  standard  result  [Pre75]  [Atk78].  To  establish  the  error  bound  for  d  >  2 
we  modify  an  argument  in  Prenter  [Pre75,  pages  123-125]. 

Assume  that  y  G  7r(/)  for  some  /  >  0.  Let  la,(y),{x)  be  the  unique  (m  —  l)st  degree 
polynomial  in  one  variable  such  that 


/  /  v  fL  =  (£)*; 

W(*)-|0>  ifl€7rjj2 


*  #  (y)s- 

when  s  <  l.  Then  a  polynomial  interpolant  to  a  function  g  in  at  the  locations  in  tt {d) 
is 

_  /  A.  \ 

(C-2) 


{Idg){x')  =  £  (g{y)  •  f[  *.,(v).((x')») 

y€r(<i)  \  *=1 


This  is  the  unique  polynomial  interpolant  of  degree  less  than  or  equal  to  m  —  1  in  each 
variable.  We  will  refer  to  this  as  the  d  dimensional  LaGrange  interpolating  polynomial.  The 
locations  in  ir(d—  1)  also  specify  an  interpolating  polynomial  on  the  (d  —  1)  dimensional 
hyperplane  ( x)d  =  z, 


{Id-1  (z)g)(x')=  £  (g{y,z)-  n /.,(»). ((*').) 

v€*(a-i)  V  *=i 

It  is  the  unique  interpolating  polynomial  of  degree  less  than  or  equal  to  m  —  1  in  the  first 
d  —  1  variables,  and  of  degree  zero  in  the  dth  variable. 

Assume  that  the  error  bound  holds  for  d  —  1.  The  interpolation  error  at  a  location 
x'  G  C(x,\ 6)  is  bounded  by 


W)  -  {hg){x')\  <  | g(x')  -  (Id-i({x')d)g)(x')\  +  | (it/.i((x%)g)(sf)  -  (Idg)(x') |  . 


C.l.  PROOF  OF  LEMMA 
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The  first  term  is  the  error  of  an  approximation  to  g  in  the  hyperplane  (x)j  =  ( x')d.  The 
second  term  is  the  difference  between  the  approximation  using  data  in  the  hyperplane  and 
the  approximation  using  the  given  sampling  locations. 

(Idg)(x')  can  be  rewritten  as 

Y  f  (  Y  9{V ,  w)  •  /^((xOd))  •  (  f[  ((*')*)) 

jj€*(d-l)  L  \w£ird  J  \4=1  /  . 

Therefore, 

|(/,_1((x')d)<7)(x0-(/<i5){x')| 

<  Y  (W  (*')*)  -  Y  9iy,u>)  •  id,w{{x')d)  \  •  (n  *..<*).((*')») 

y€*(d~  1)  L  V  “'€»<!  /  V»=l 

(C.3) 


Since 


Y  S(y**>)  ■  ld,v>((*')d) 

««€if d 


is  the  univariate  interpolant  to  g(y,  (x')<t), 


9{y .  (*')d)  ~  Y  9{y ,  «>)  •  ld,»((*')d) 


<  C,  •  6” 


max 

*€[(*.  ).-«/2,(i.).+«/2] 


m\ 


r  9{x) 


(C.4) 

for  each  y  €  7r(d  —  1).  The  following  bound  is  proven  by  an  argument  described  on  page 
41  of  Prenter  [Pre75]; 

Y  max  £  2m 

fc=l  v,€lr(« ,0)>*(j,m+J)J 

for  all  s  €  {1, . . .  ,d}.  Therefore, 

y  n  =  y  fn  •  (  y  id-i,w((*')d-i) ) ) 

fi€*(rf-l)*=l  j/€*(d-2)  \*=1  \  // 

<  2-.  y  ffw(*0.) 

y€*(d- 2)  *=1 


<  2m(d~l)  .  (C.5) 
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APPENDIX  C.  BOUND  ON  INTERPOLATION  ERROR 


Combining  inequalities  C.4  and  C.5  allows  us  to  replace  inequality  C.3  by 

l(/j-.((i’W9)(i')  -  (/w«)(x')|  <  •  C,  •  S~  ■  max  .  (C.6) 

i6C(r.;«)  0{X)d 

The  function  (Id-i{{x ')d)g)(x')  is  an  approximation  to  g  in  a  (d  —  1)  dimensional  cube 
embedded  in  the  hyperplane  xd  =  x'd,  Therefore,  by  assumption,  there  exists  a  constant 
Cd-i  such  that 

l$(*')  -  (/d-i((i/)d)flr)(x')|  <  Cd-i-P*-  max  ma x  -^g(x) 

x€C(x.\6)  OX * 

(*)-=(*')<< 

<  Cj_  i  •  <5m  •  max  max 

r6(?(r.;5)  jMzKa- 1  (m) 

The  bound  on  the  error  in  the  approximation  is  then 

Ij(i')  -  (/*)(*')!  <  + 

by  inequalities  C.7  and  C.6.  The  Lemma  is  proven  for  d  dimensional  cubes  when 

Cd  =  2m(d-1)  •  Ci  +  Cd-i  . 

The  lemma  follows  for  all  d  >  0  by  induction.  I 


* 
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